Here’s a reference to my old thread… I thought I had it fixed.
I have now had two more failures in the last month. Kind of lost as to what the check here. Failures always seem to occur within 30 days of a reboot. This one happened about 5 days after the last reboot. If the system can go over 30 days the liklihood they fail is low. If it goes up to 100+ days I don’t think they’ve ever had a failure in that case.
R730XD
ASUS M.2 4 Slot Card (PCIE Bifurcation enabled)
Samsung 990 Pro
My first SSD, a 2.5" SATA Samsung 1TB had a bug in which data corruption could occur because of the multi-level cell read methodology. Basically in order to store multiple bits in a single cell, it’s based on analog voltage. Too tight of read tolerances, and the bits could be lost.
The cells leak power over time, thus the need to have a range of values for each cell to determine what the output bits should be.
Anyway, Samsung released a firmware update that supposedly solved that problem. And close to 10 years later I am still using that SSD, (though not much, it’s in my old laptop).
Try some Kingston KC3000 ? I have moved to those for my main rig, but my TrueNAS does have 2 x 980 PRO 2TB NVMe’s running right off the mobo main NVMe links and no issues for over a year now…
All, I did check firmware all firmware is up to date… I think I finally stumbled upon the issue. The drives are physically opeating in power modes 3 and 4.
Modes 3 and 4 are non-operational modes and used only for idle. I’ve now checked it multiple times and this is still true.
I am trying to force them to stay in Power mode 1. Hopefully they stay cool
not sure it was needed to do it both ways but power states are now in power state 0 and I’ve confirmed this via sudo nvme get-feature -f 0x0c -H /dev/nvme0
Additionally I dug through my smart test reports that run via the multi_report tool that’s available on the forums… idk how I missed this earlier but the drives are almost always in Power state 4… The one that failed last reported power state 0 in the last report. Seems like APST was not working correctly. I will update this thread when I get significant time on drives.
It should not in normal conditions, but this is likely a hardware issue or a firmware issue so… it could be. Waiting for @loca5790 data in order to validate any theory.
*raidz! there is still multiple writes occurings across the drives in the raidz setup versus the raid which to me seems like it could cause issues with power states being requested. Whether it be hardware or software issue. If it was a single drive writing data I doubt there’d be the same effect.
It’s been an additional 11 days power states are staying in state 0 and drives are functioning as expected. I will do another long term udpate in a few months.
Come to think of it… I do have a 3x mirror 2 wide array as well in one of my pools to incresae the SSD read write speeds. Inherently raidz does not increase the true performance of the writes if my memor servers correctly it’s more a data redundancy storage system.
I can confirm that everything is still working fine.
I will note that two of the drives do show this weird reporting:
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 44 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 2%
Data Units Read: 7,111,160 [3.64 TB]
Data Units Written: 38,276,183 [19.5 TB]
Host Read Commands: 357,561,116
Host Write Commands: 1,197,663,352
Controller Busy Time: 4,795
Power Cycles: 6
Power On Hours: 1,736
and the controller busy time is updating at the same rate as power on (aka 100 hours of power on equals 100 counts of controller busy time which I believe is in minutes). This number increase is drastically lower since the update of the power state.