I had smart errors (sort test and extended offline) but not zfs errors: replace or wait?

lvrpp · April 11, 2025, 10:04am

Hi,
my NAS (TrueNAS Scale) has 6 hard disks and one of them recently showed errors. The SMART Info for the drive showed: Last Short Test and Last Extended Offline Test both with: FAILURE!
But the zfs Info of the drive showed NO errors!

I replaced the drive anyway and resilvered the new drive. But I have some questions:

as long as there are no zfs errors, would it be “save” to keep using the drive until zfs errors occur? My pool is a RAIDZ2 that means two of the drives can break down without data loss. I mean should I replace the drive as soon as any errors occur or first when zfs errors occur?
Is there anything i can do to test if the old drive is still usable? Maybe format and retest or such? Or should I throw it away?

etorix · April 11, 2025, 10:23am

That’s what redundancy is for: Preserve data integrity even when the underlying media is not reliable.
But for ZFS to properly protect your data, you should still react to hardware failures and act timely—which you have done. When ZFS errors occur, you have lost data.

You may use this resource to assess the condition of the failed drive:

But, essentially, if it has failed a long SMART test, the drive is due for RMA or for disposal.

dan · April 11, 2025, 10:35am

Not necessarily–ZFS can show errors on one device in a redundant vdev with no loss of data. That’s, as you say, what redundancy is for. But:

I do not understand why so many people apparently believe there should be a connection between these two things.

Agreed.

lvrpp · April 11, 2025, 11:25am

Ok thanks, then I will throw it away

davidbbaker · April 11, 2025, 11:57am

you have lost data when ZFS error occurs

dan · April 11, 2025, 11:58am

No, you haven’t. At least, not necessarily. If you have a mirrored vdev, and one device shows a ZFS checksum error while the other is clear, you haven’t lost any data.

davidbbaker · April 11, 2025, 11:59am

I appericiate forumtruenas you are really helping your community.

etorix · April 11, 2025, 12:28pm

Then I don’t know how you define “ZFS errors”.
For me, a “ZFS error” is when zpool status -v ends with a list of files or metadata indexes. Not good…

dan · April 11, 2025, 12:32pm

I think that’s an unduly narrow definition. I’d consider any non-zero value under READ, WRITE, or CKSUM to constitute a ZFS error, like this:

admin@nas[~]$ sudo zpool status software
  pool: software
 state: ONLINE
  scan: scrub repaired 0B in 00:04:17 with 0 errors on Sun Mar 30 00:04:18 2025
config:

	NAME                                      STATE     READ WRITE CKSUM
	software                                  ONLINE       0     0     0
	  mirror-0                                ONLINE       0     0     0
	    7403f6bc-8b82-48a8-85f5-9577ea321109  ONLINE       0     0     0
	    276fbdc8-621e-4450-8336-5f926fcc8452  ONLINE       0     0     8

errors: No known data errors

lvrpp · April 12, 2025, 4:37pm

After the disc switch, the old drive had a successful extended test. Also i have to mention, that I had the disk plugged out for cleaning the nas case, and then plugged in again. Is there some kind of stress test I can perform on the drive to see if it the errors come back?