Unexplained, one-time NVMe pool "error resulting in data corruption"

BackDatNASUp · August 14, 2025, 10:08pm

I was running TrueNAS Scale Electric Eel when the following happened on 2025-07-06. The pool described below is on a single drive (yes I know I will lose data if it fails).

I woke up to find 3 Critical errors (in chronological order):

Device: /dev/nvme0n1, failed to read NVMe SMART/Health Information.
Replication "[TASK NAME]" failed: resume token contents: nvlist version: 0 object = 0x75873 offset = 0x0 bytes = 0xb2e711c1c toguid = 0xbea8d7972631dd10 toname = [POOL NAME]/[DATASET NAME]@auto-2025-06-09_04-00 compressok = 1 rawok = 1 warning: cannot send '[POOL NAME]/[DATASET NAME]@auto-2025-06-09_04-00': Input/output error cannot receive resume stream: checksum mismatch or incomplete stream. Partially received snapshot is saved. A resuming stream can be generated on the sending system by running: zfs send -t 1 [...]
Pool [POOL NAME] state is ONLINE: One or more devices has experienced an error resulting in data corruption. Applications may be affected.

Observations:

All of my apps were down at this time
The pool in question was missing in the UI
I had a VM running on the same NVMe drive that was working without issue

After rebooting the system:

Everything was back to normal
I ran a scrub on the pool in question, which yielded no errors
I checked the SMART data for the drive in question, and it showed no errors
The UI shows no errors for this drive
The replication task described in Critical error #2 (above) completed successfully

That was now almost 6 weeks ago and:

I haven’t had any other issues
I see 0 evidence of corruption or performance issues

I’m really confused by this:

This is a good-quality drive (Samsung 990 PRO) that hasn’t been used very extensively
I don’t understand how TrueNAS can experience such a meltdown, yet not show any errors anywhere that I can see
It seems like, for whatever reason, TrueNAS was temporarily unable to read the SMART data on the drive, then assumed the drive had failed

Anyone have ideas?

swc-phil · August 14, 2025, 10:48pm

You should run a few rounds of Memtest86(+). Just in case.

joeschmuck · August 15, 2025, 12:31am

Did you perform a SMART Long test? You need to use nvme commands, TrueNAS will not test your NVMe drive yet.

It doesn’t work that way. Odds are lack of RAM.

You are asking for some help, even if it is informational, yet you did not list your hardware making up your system. This is critical data in providing you what we suspect it might have been and also any data to look at that might provide a clue, and any testing which might be prudent.

I agree with @swc-phil that you should run MemTest86+, run it for five (5) complete passes. Then run a CPU Stress test like Prime95 for at least 4 hours.

We can’t stress test every component but these two tests weed out a lot of things.

Without knowing, I’m going to say that you ran out of memory and the system crashed. If that happened or something similar, examine the Reports charts and scroll back to the day in question. Look to see how much RAM was free.

You are lucky. Most people have problems like this and lose data.

BackDatNASUp · August 23, 2025, 5:55pm

I performed multiple Long tests and see no errors. I have 94 GiB of RAM, which is typically allocated as follows:

Services: 15 GiB
Free: 6 GiB
Cache: 73 GiB

I actually realized that /dev/nvme0n1 (the drive it failed to read SMART data from) is my boot drive, not the drive with the apps. However, I’ve run Long SMART tests on this as well and see no errors. This is also a much newer Samsung 990 PRO drive.

joeschmuck · August 23, 2025, 6:21pm

How did you run the SMART Long test? TrueNAS GUI, CLI?

truenas-fan · August 23, 2025, 6:29pm

A proper OS deals with low memory properly, plus, if that was the issue, it should happen more than once, right?!

I doubt Memtest, always recommend by people that know some, but lack deep digital systems knowledge, will find anything, as true bad memory is evident and again, it would happen more often!

truenas-fan · August 23, 2025, 6:34pm

Wait…
Samsung 990 PRO is known to have issues…
Let’s check that first before the non-hardware people start with their Memtest and other waste of time.

BackDatNASUp · August 23, 2025, 6:43pm

I ran it via the GUI, then went into the CLI to check its progress. It was showing that it was running. I thought the NVMe issue with the GUI was it wound run the tests, but not show the results - is that incorrect? What is the nvme command to run SMART Long test? I didn’t see it in the man page, but may have missed it.

Running command sudo nvme self-test-log /dev/nvme0n1 shows several 0xf values for Self Test Results [N]:. Not sure what that means, but command sudo smartctl -x /dev/nvme0n1 shows 0 values for Media and Data Integrity Errors and Error Information Log Entries.

BackDatNASUp · August 23, 2025, 6:44pm

Do you have any info to cite that I could read about? I sure hope not - I recently installed this to replace a Kingston boot drive that failed on me, which I actually did see errors for in the CLI.

winnielinnie · August 23, 2025, 7:04pm

Eight different users on these forums resulting in failed memtest runs within a year of each other, which helped to steer them into fixing the issue that should have been “evident”… without them actually running the tests? ^[1] ^[2] ^[3] ^[4] ^[5] ^[6] ^[7] ^[8]

It takes very little time and effort to run a memtest to rule out bad RAM for a NAS server that writes and reads important data. It’s not the end of the world to run them. I even do so for my main and backup servers when I have to power them down or do an upgrade. Since I’ve already powered down, I might as well run a couple passes, in the same spirit as running extended SMART tests on my drives.

joeschmuck · August 23, 2025, 8:05pm

As far as scheduling a routine test, this is true.

I am very curious by what exact steps you took to run the Smart Test. I’m asking incase I am totally incorrect. I’d love it if I were in this case.

I have yet to see TrueNAS GUI using the SMART Tests under Data Protection actually run a SMART test on NVMe media.

You can run a SMART test a few ways right now, use the CLI to run smartctl -t long /dev/nvme0 and that will work and is the safer and easier method, or using the nvme command as you mentioned, which is a bit more involved. Before smartmontools 7.4 was included, you had to use the nvme commands to do any testing. TrueNAS (the programmers) is waiting for smartmontools 7.5 to reach the debian distro. It has the smartd which will fix the TrueNAS scheduling of SMART tests, or so the ticket says. This has been a problem for a while.

When you check the status, you run smartctl -a /dev/nvme0 and you compare the Power On Hours (ID 9) to the tests run at the bottom. The hour value should be when the test completed. NVMe drives do not take long to perform a Long test, my 4TB drives only take 20 minutes. If the Power On Hours values to not line up, then the test was not run.

The Media Integrity Errors being zero is a must! These values related to an Uncorrectable Sector Error for a HDD. The difference is, with NVMe, the problem could be anything within the electronics causing it. Bad memory, bad controller, or bad something. I have a single 4TB NVMe drive that has media errors. I stopped those from growing by reformatting the drive, which mapped the failing parts out of existence. I still have the errors listed, they live forever with the drive data. I am not using this drive for anything other than testing. The manufacturer shipped me a replacement and did not ask for the return of the failed item.

While this may be a “HOT” drive (most of the complaints I’ve read) and could be the cause, in your original post all got better. This leads us to cover the items which have been discussed. I do not agree with assuming it was just the NVMe drive. You test it, if it fails, okay, obvious failure. But if it passes, now we have an intermittent problem. This leads us to system stability / stress tests with the hope to force the failure to show itself, assuming there is an actual hardware failure.

Memtest is not a waste of time, especially if you have a problem and are having a difficult time locating it. However if you have run it for 5 complete passes, then you can call it good “for now”. An intermittent problem can test good for days and then pop up out of nowhere.

And you can read what my colleague posted directly above.

BackDatNASUp · August 23, 2025, 8:38pm

I ran the SMART Long tests on both of my NVMe drives via the following in the web UI: Storage → Disks → Select disk → Manual Test → Type: Long → Start

I’m running the smartctl -t long /dev/nvme0 command you provided (thank you) and confirm it is running via smartctl -a /dev/nvme0. Will report back when it finishes.

This drive is usually running at 36C and showing a max of 43C, so that should be good.

I will run a memtest when I get a good opportunity. I have to find the right time because I rely heavily on this system (NAS, NVR, home automation, etc.), which is an unfortunate situation.

BackDatNASUp · August 23, 2025, 9:16pm

Sorry I appear to have replied to the wrong person - see my previous message. The long SMART test completed without error, as evidenced by smartctl -a /dev/nvme0 on the entry with the matching Power On Hours.

joeschmuck · August 24, 2025, 1:56am

Not a problem here at all with the reply.

Glad the Long test passed.

If you wanted to run routine SMART tests on the NVMe drive, you have a few options:

Create a CRON Job to run a daily short test and a weekly long test. That is actually very easy and the simplest to do if you only have a few drives to do.
Manually run the tests as you see fit.
Use Multi-Report or possibly just Drive-Selftest, both scripts I maintain.

That is very good. Some of those drives get very hot, 60C + and remain there.

I understand and when you do find the time, test your system. I’m taking one of my systems down to do some cleaning (dusting) and then replacing all the thermal pads on my NVMe drives. Once I remove those drives I plan to run MemTest86+ for as long as it takes me to replace those thermal pads. I’m slow as I would rather not buy $300 NVMe drives for being in a rush.

Topic		Replies	Views
NVMe Short S.M.A.R.T. Test Returns "failed segments" For All Tests TrueNAS General SCALE , Hardware , SMART , NVMe	11	544	August 4, 2025
Really high I/O Wait time TrueNAS General	7	142	August 10, 2025
SMART tests fail but badblocks passes TrueNAS General Hardware , SAS , SMART	16	461	May 13, 2025
Help with ZFS errors TrueNAS General SCALE , ZFS , Troubleshooting	15	216	January 1, 2026
First failed drive, it looks healthy on paper, what to do, given it's under warranty? TrueNAS General Hardware	13	209	September 30, 2024

Unexplained, one-time NVMe pool "error resulting in data corruption"

Related topics