I was running TrueNAS Scale Electric Eel when the following happened on 2025-07-06. The pool described below is on a single drive (yes I know I will lose data if it fails).
I woke up to find 3 Critical errors (in chronological order):
Device: /dev/nvme0n1, failed to read NVMe SMART/Health Information.Replication "[TASK NAME]" failed: resume token contents: nvlist version: 0 object = 0x75873 offset = 0x0 bytes = 0xb2e711c1c toguid = 0xbea8d7972631dd10 toname = [POOL NAME]/[DATASET NAME]@auto-2025-06-09_04-00 compressok = 1 rawok = 1 warning: cannot send '[POOL NAME]/[DATASET NAME]@auto-2025-06-09_04-00': Input/output error cannot receive resume stream: checksum mismatch or incomplete stream. Partially received snapshot is saved. A resuming stream can be generated on the sending system by running: zfs send -t 1 [...]Pool [POOL NAME] state is ONLINE: One or more devices has experienced an error resulting in data corruption. Applications may be affected.
Observations:
- All of my apps were down at this time
- The pool in question was missing in the UI
- I had a VM running on the same NVMe drive that was working without issue
After rebooting the system:
- Everything was back to normal
- I ran a scrub on the pool in question, which yielded no errors
- I checked the SMART data for the drive in question, and it showed no errors
- The UI shows no errors for this drive
- The replication task described in Critical error #2 (above) completed successfully
That was now almost 6 weeks ago and:
- I haven’t had any other issues
- I see 0 evidence of corruption or performance issues
I’m really confused by this:
- This is a good-quality drive (Samsung 990 PRO) that hasn’t been used very extensively
- I don’t understand how TrueNAS can experience such a meltdown, yet not show any errors anywhere that I can see
- It seems like, for whatever reason, TrueNAS was temporarily unable to read the SMART data on the drive, then assumed the drive had failed
Anyone have ideas?