Nvme dropping out of vdev

The server in question is running 25.04.2.6

one of the vdevs that i use for vms is 4 nvme drives in a striped mirror

Occasionally when scrubbing, one random drive will drop out of the vdev. a reboot brings it back with no issues. this has happened 4 or 5 times over the past few months. it seems truely random, as i have even moved the drives into different slots and nothing tracks.

dmesg shows:

[891930.550148] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[891930.550580] nvme nvme0: Does your device have a faulty power saving mode enabled?
[891930.551003] nvme nvme0: Try ā€œnvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=offā€ and report a bug
[891930.634105] nvme 0000:b3:00.0: Unable to change power state from D3cold to D0, device inaccessible

When i try to add the suggested what i think is Sysctl parameters, truenas says they arent in the kernel.

any suggestions? is there a way to add those parameters or is that just from the debian kernel and truenas doesnt support that?

Managed devices/edit disk shows hdd standby = always on and advanced power management = disabled, so not sure what else to do

the nvme’s are all Samsung 990 Pro 4T

the power supply should be big enough to run double what i have, but i guess you never know; if nothing else i can change that out when i get a chance just to see.

Maybe a thermal issue?

I have some U2’s that dropped out under load and were fine once i rebooted (to initiate a rescan of the pice devices since the box didnt support nvme hotswap)

thats a good idea

i forgot to say the nvme’s have the factory heat sink on them (they were cheaper :))

and i added two fans in the pci slot next to them blowing air directly on them; they never get very hot at all.

so i dont think thats it (but i originally did, hence the fans..)

I would bet you this is a thermal issue. NVMe drives generally run fairly cool unless under load. A SCRUB is a significant load.

There is a reason Samsung sells this model with a heatsink installed.

You can look at the NVMe temps during a scrub, just start a scrub, wait a few minutes, then enter the command smartctl -x /dev/nvme0 to see the real temperature. Repeat this command several times during the scrub so you can get an idea of the temperature fluctuations.

If the NVMe drive does not exceed it’s thermal limits and you are still having the problem, there are other things to try.

Has the drive always been NVMe0 being dropped? If yes, is it the same serial number? If yes, I’d rotate the NVMe drives around one location. If you have a straight line of them, shift them to the right one space, the one on the end goes to the first slot. Now see if the problem comes back, if it does, note the drive by serial number and slot location. You could have a tempermental slot.

Do you know if your PCIe bus is running at PCIe 4 speeds? If yes, you can see if the MB allows you to slow down to PCIe 3 speeds, and test again. While running the system slower may not be what you want, it would be a good indicator for troubleshooting.

Become creative in troubleshooting. Rule out as many things as possible. It wil likely not be a quick fix.

And while this may not be directly related to the problem you are asking about, have you run any stability tests? Prime95 cpu stress test for at least 4 hours, MemTest86+ for at least 5 complete passes?

1 Like

No chance that they are missing firmware updates? Was a whole few a couple of years back with Samsung. I personally had to pull a few to toss into a windows system to update them, was a pita

i just checked, and they are on 4B2QJXD7 and the latest is 7B2QJXD7.

not sure what all is different, but that could be it based on a build i did with 980 pros a couple years ago (not truenas)…

im going to try to update them via a vm (removing one at a time, and assigning it as a passthrough drive) and if that doesnt work, put it in a windows box (ugh)

thanks for the idea.

1 Like

i have 4 of these, two each on two pcie adapter boards.

i have rotated the memories, their place on the boards, and the boards pcie slot; the failures dont track anything in particular, except that it happens during a scrub.

i first thought it might be temperature so i put two fans blowing directly on them, and the temps never get very hot, so i dont think thats it either.

it may just be the firmware timing out, something i havent addressed yet. ill give that a shot tomorrow

thanks for the help

One thing I have heard that affects NMVe reliability, is power save. Some NVMe drives have the ability to save power, which can cause a delay in resuming full use. It is possible that this ā€œfeatureā€ in enabled by default on the affected drives, AND that the delay is long enough for ZFS to decide that the drive failed.

When ZFS was originally written, power save was never considered. (Solaris is an Enterprise Data Center OS…)

Some of the power save features that hard disk drives have today, absolutely affect ZFS. For example, TLER, Time Limited Error Recovery. If set too long, (desktop HDDs generally use >60 seconds by default), ZFS may drop the entire drive for a bad block. Even if spare blocks are available & RAID is used. Sometimes aggressive hard drive parking, like on laptop drives, can also cause problems.

That said, their are NVMe specific power saving items you might want to check and make sure they are not set too aggressively. For example, PCIe power saving is actually a thing…

2 Likes

so thanks for all the ideas, this is a new one for me

i updated the firmware on all 4 drives, so i guess we will see if that helps.

as for the power save; advanced power management is turned off in truenas, but im not sure that does what it should.

i have samsung 980 pros in my windows machine, and it seems for those drives they have disabled full power mode, which ā€˜prevents ssd from going to sleep or idle state’ which may be the issue. but the samsung magician software wont let you change that setting unless you have a windows type partition on the drive, so i tried but failed to set that feature to anything other than default, whatever that was. ditto for trim status and over provisioning; those are the only three parameters magician allows you to set.

so i guess we wait until i do another scrub and see what happens.

about the only thing left is the psu; i have about 50% headroom on that, but you never know….

here is hoping it was the firmware :slight_smile:

Fairly certain that nvme only uses 3.3v - I mean I’m you could likely monitor that from the IPMI, but I’m also very certain that whatever you got hooked up ain’t such a hard draw on the 3.3 line that it is causing your nvme to drop.

Come to think of it I guess it is possible that 3.3v is generated on the motherboard from 12v not drawn directly from psu, but we’re getting to the point where I’d need to start probing with a multimeter.

Anyway, if it was a PSU issue you’d see instability present itself on any & everything else with a higher power draw before it’d be narrowed down to a single nvme drive.

The firmware seems to be the issue; once i updated the 4 nvme ssd’s the drives stopped dropping out of the vdev array. might be a coincidence, but i have scrubbed/smart tested/stressed everything i can think of and no issues.

This is literally only the 2nd time in my 50 years of doing this that a firmware issue caused a problem for me. i would have never figured this out, so thanks!

but then i checked the logs, and found another vdev in the same server was having a similar issue; three drives were randomly (one at a time) dropping out of the array, but resilvering itself so quickly that it went unnoticed. This, and the nvme issue (which said ā€˜removed’ so you couldnt miss the issue) both started happening after i put the two pcie nvme adapter boards into the one server. until then everything was fine.

so i spend hours removing one drive at a time out of the live arrays, updating the firmware, and resilvering back into the arrays. i did that 31 times :slight_smile: and now all the vdev drives in these two servers have the latest firmware. haven’t seen any issues yet, but its only been a couple days. but i have been stressing all of them, over and over, with scrubs, smart tests etc to try to get a failure, and nothing yet.

strange the exact same model drives in the 2nd server never had this issue, so maybe it was some interaction with the pcie boards. in any event, fingers crossed…

1 Like