Samsung 990 Pro Early Failures (x4)

Here’s a reference to my old thread… I thought I had it fixed.

I have now had two more failures in the last month. Kind of lost as to what the check here. Failures always seem to occur within 30 days of a reboot. This one happened about 5 days after the last reboot. If the system can go over 30 days the liklihood they fail is low. If it goes up to 100+ days I don’t think they’ve ever had a failure in that case.

R730XD
ASUS M.2 4 Slot Card (PCIE Bifurcation enabled)
Samsung 990 Pro

That nasty issue, yeah.
I would suggest changing drives.

Well, I would check for a firmware update.

My first SSD, a 2.5" SATA Samsung 1TB had a bug in which data corruption could occur because of the multi-level cell read methodology. Basically in order to store multiple bits in a single cell, it’s based on analog voltage. Too tight of read tolerances, and the bits could be lost.

The cells leak power over time, thus the need to have a range of values for each cell to determine what the output bits should be.

Anyway, Samsung released a firmware update that supposedly solved that problem. And close to 10 years later I am still using that SSD, (though not much, it’s in my old laptop).

2 Likes

Supposedly they run the latest firmware… at least according to the original thread if I did not read wrong.

1 Like

Try some Kingston KC3000 ? I have moved to those for my main rig, but my TrueNAS does have 2 x 980 PRO 2TB NVMe’s running right off the mobo main NVMe links and no issues for over a year now…

I’ve read the old thread. Did you ever end up applying a newer firmware update?

All, I did check firmware all firmware is up to date… I think I finally stumbled upon the issue. The drives are physically opeating in power modes 3 and 4.

Modes 3 and 4 are non-operational modes and used only for idle. I’ve now checked it multiple times and this is still true.

I am trying to force them to stay in Power mode 1. Hopefully they stay cool :smiley:

1 Like

Interesting discovery. Keep us updated.

Ok…

Power states are now forced to 0 and persistent across boots. Had to disable APST via the following.

sudo nano /etc/default/grub.d/truenas.cfg

added nvme_core.default_ps_max_latency_us=0 to linux_default

sud update-grub

the ran

midclt call system.advanced.update '{"kernel_extra_options": 
 nvme_core.default_ps_max_latency_us-0"}'

not sure it was needed to do it both ways but power states are now in power state 0 and I’ve confirmed this via sudo nvme get-feature -f 0x0c -H /dev/nvme0

Additionally I dug through my smart test reports that run via the multi_report tool that’s available on the forums… idk how I missed this earlier but the drives are almost always in Power state 4… The one that failed last reported power state 0 in the last report. Seems like APST was not working correctly. I will update this thread when I get significant time on drives.

2 Likes

Question: does the Advanced Power Management setting in Storage>Disks have anything to do with this issue/potential solution?

It should not in normal conditions, but this is likely a hardware issue or a firmware issue so… it could be. Waiting for @loca5790 data in order to validate any theory.

1 Like

Drives are consistently staying in power state 0.

I will have to update again after some more times been put on drives.

I’m not sure why they would sit in those states other than potentially in compatibility with RAID.

You should not be using RAID with ZFS.

*raidz! there is still multiple writes occurings across the drives in the raidz setup versus the raid which to me seems like it could cause issues with power states being requested. Whether it be hardware or software issue. If it was a single drive writing data I doubt there’d be the same effect.

It’s been an additional 11 days power states are staying in state 0 and drives are functioning as expected. I will do another long term udpate in a few months.

Come to think of it… I do have a 3x mirror 2 wide array as well in one of my pools to incresae the SSD read write speeds. Inherently raidz does not increase the true performance of the writes if my memor servers correctly it’s more a data redundancy storage system.

I can confirm that everything is still working fine.

I will note that two of the drives do show this weird reporting:
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 44 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 2%
Data Units Read: 7,111,160 [3.64 TB]
Data Units Written: 38,276,183 [19.5 TB]
Host Read Commands: 357,561,116
Host Write Commands: 1,197,663,352
Controller Busy Time: 4,795
Power Cycles: 6
Power On Hours: 1,736

and the controller busy time is updating at the same rate as power on (aka 100 hours of power on equals 100 counts of controller busy time which I believe is in minutes). This number increase is drastically lower since the update of the power state.

1 Like