one of the vdevs that i use for vms is 4 nvme drives in a striped mirror
Occasionally when scrubbing, one random drive will drop out of the vdev. a reboot brings it back with no issues. this has happened 4 or 5 times over the past few months. it seems truely random, as i have even moved the drives into different slots and nothing tracks.
dmesg shows:
[891930.550148] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[891930.550580] nvme nvme0: Does your device have a faulty power saving mode enabled?
[891930.551003] nvme nvme0: Try ānvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=offā and report a bug
[891930.634105] nvme 0000:b3:00.0: Unable to change power state from D3cold to D0, device inaccessible
When i try to add the suggested what i think is Sysctl parameters, truenas says they arent in the kernel.
any suggestions? is there a way to add those parameters or is that just from the debian kernel and truenas doesnt support that?
Managed devices/edit disk shows hdd standby = always on and advanced power management = disabled, so not sure what else to do
the nvmeās are all Samsung 990 Pro 4T
the power supply should be big enough to run double what i have, but i guess you never know; if nothing else i can change that out when i get a chance just to see.
I have some U2ās that dropped out under load and were fine once i rebooted (to initiate a rescan of the pice devices since the box didnt support nvme hotswap)
I would bet you this is a thermal issue. NVMe drives generally run fairly cool unless under load. A SCRUB is a significant load.
There is a reason Samsung sells this model with a heatsink installed.
You can look at the NVMe temps during a scrub, just start a scrub, wait a few minutes, then enter the command smartctl -x /dev/nvme0 to see the real temperature. Repeat this command several times during the scrub so you can get an idea of the temperature fluctuations.
If the NVMe drive does not exceed itās thermal limits and you are still having the problem, there are other things to try.
Has the drive always been NVMe0 being dropped? If yes, is it the same serial number? If yes, Iād rotate the NVMe drives around one location. If you have a straight line of them, shift them to the right one space, the one on the end goes to the first slot. Now see if the problem comes back, if it does, note the drive by serial number and slot location. You could have a tempermental slot.
Do you know if your PCIe bus is running at PCIe 4 speeds? If yes, you can see if the MB allows you to slow down to PCIe 3 speeds, and test again. While running the system slower may not be what you want, it would be a good indicator for troubleshooting.
Become creative in troubleshooting. Rule out as many things as possible. It wil likely not be a quick fix.
And while this may not be directly related to the problem you are asking about, have you run any stability tests? Prime95 cpu stress test for at least 4 hours, MemTest86+ for at least 5 complete passes?
No chance that they are missing firmware updates? Was a whole few a couple of years back with Samsung. I personally had to pull a few to toss into a windows system to update them, was a pita
i just checked, and they are on 4B2QJXD7 and the latest is 7B2QJXD7.
not sure what all is different, but that could be it based on a build i did with 980 pros a couple years ago (not truenas)ā¦
im going to try to update them via a vm (removing one at a time, and assigning it as a passthrough drive) and if that doesnt work, put it in a windows box (ugh)
i have 4 of these, two each on two pcie adapter boards.
i have rotated the memories, their place on the boards, and the boards pcie slot; the failures dont track anything in particular, except that it happens during a scrub.
i first thought it might be temperature so i put two fans blowing directly on them, and the temps never get very hot, so i dont think thats it either.
it may just be the firmware timing out, something i havent addressed yet. ill give that a shot tomorrow
One thing I have heard that affects NMVe reliability, is power save. Some NVMe drives have the ability to save power, which can cause a delay in resuming full use. It is possible that this āfeatureā in enabled by default on the affected drives, AND that the delay is long enough for ZFS to decide that the drive failed.
When ZFS was originally written, power save was never considered. (Solaris is an Enterprise Data Center OSā¦)
Some of the power save features that hard disk drives have today, absolutely affect ZFS. For example, TLER, Time Limited Error Recovery. If set too long, (desktop HDDs generally use >60 seconds by default), ZFS may drop the entire drive for a bad block. Even if spare blocks are available & RAID is used. Sometimes aggressive hard drive parking, like on laptop drives, can also cause problems.
That said, their are NVMe specific power saving items you might want to check and make sure they are not set too aggressively. For example, PCIe power saving is actually a thingā¦
so thanks for all the ideas, this is a new one for me
i updated the firmware on all 4 drives, so i guess we will see if that helps.
as for the power save; advanced power management is turned off in truenas, but im not sure that does what it should.
i have samsung 980 pros in my windows machine, and it seems for those drives they have disabled full power mode, which āprevents ssd from going to sleep or idle stateā which may be the issue. but the samsung magician software wont let you change that setting unless you have a windows type partition on the drive, so i tried but failed to set that feature to anything other than default, whatever that was. ditto for trim status and over provisioning; those are the only three parameters magician allows you to set.
so i guess we wait until i do another scrub and see what happens.
about the only thing left is the psu; i have about 50% headroom on that, but you never knowā¦.
Fairly certain that nvme only uses 3.3v - I mean Iām you could likely monitor that from the IPMI, but Iām also very certain that whatever you got hooked up aināt such a hard draw on the 3.3 line that it is causing your nvme to drop.
Come to think of it I guess it is possible that 3.3v is generated on the motherboard from 12v not drawn directly from psu, but weāre getting to the point where Iād need to start probing with a multimeter.
Anyway, if it was a PSU issue youād see instability present itself on any & everything else with a higher power draw before itād be narrowed down to a single nvme drive.
The firmware seems to be the issue; once i updated the 4 nvme ssdās the drives stopped dropping out of the vdev array. might be a coincidence, but i have scrubbed/smart tested/stressed everything i can think of and no issues.
This is literally only the 2nd time in my 50 years of doing this that a firmware issue caused a problem for me. i would have never figured this out, so thanks!
but then i checked the logs, and found another vdev in the same server was having a similar issue; three drives were randomly (one at a time) dropping out of the array, but resilvering itself so quickly that it went unnoticed. This, and the nvme issue (which said āremovedā so you couldnt miss the issue) both started happening after i put the two pcie nvme adapter boards into the one server. until then everything was fine.
so i spend hours removing one drive at a time out of the live arrays, updating the firmware, and resilvering back into the arrays. i did that 31 times and now all the vdev drives in these two servers have the latest firmware. havenāt seen any issues yet, but its only been a couple days. but i have been stressing all of them, over and over, with scrubs, smart tests etc to try to get a failure, and nothing yet.
strange the exact same model drives in the 2nd server never had this issue, so maybe it was some interaction with the pcie boards. in any event, fingers crossedā¦