I had smart errors (sort test and extended offline) but not zfs errors: replace or wait?

Hi,
my NAS (TrueNAS Scale) has 6 hard disks and one of them recently showed errors. The SMART Info for the drive showed: Last Short Test and Last Extended Offline Test both with: FAILURE!
But the zfs Info of the drive showed NO errors!

I replaced the drive anyway and resilvered the new drive. But I have some questions:

  1. as long as there are no zfs errors, would it be “save” to keep using the drive until zfs errors occur? My pool is a RAIDZ2 that means two of the drives can break down without data loss. I mean should I replace the drive as soon as any errors occur or first when zfs errors occur?
  2. Is there anything i can do to test if the old drive is still usable? Maybe format and retest or such? Or should I throw it away?

That’s what redundancy is for: Preserve data integrity even when the underlying media is not reliable.
But for ZFS to properly protect your data, you should still react to hardware failures and act timely—which you have done. When ZFS errors occur, you have lost data.

You may use this resource to assess the condition of the failed drive:

But, essentially, if it has failed a long SMART test, the drive is due for RMA or for disposal.

1 Like

Not necessarily–ZFS can show errors on one device in a redundant vdev with no loss of data. That’s, as you say, what redundancy is for. But:

I do not understand why so many people apparently believe there should be a connection between these two things.

Agreed.

Ok thanks, then I will throw it away

you have lost data when ZFS error occurs

No, you haven’t. At least, not necessarily. If you have a mirrored vdev, and one device shows a ZFS checksum error while the other is clear, you haven’t lost any data.

1 Like

I appericiate forumtruenas you are really helping your community.

Then I don’t know how you define “ZFS errors”.
For me, a “ZFS error” is when zpool status -v ends with a list of files or metadata indexes. Not good…

I think that’s an unduly narrow definition. I’d consider any non-zero value under READ, WRITE, or CKSUM to constitute a ZFS error, like this:

admin@nas[~]$ sudo zpool status software
  pool: software
 state: ONLINE
  scan: scrub repaired 0B in 00:04:17 with 0 errors on Sun Mar 30 00:04:18 2025
config:

	NAME                                      STATE     READ WRITE CKSUM
	software                                  ONLINE       0     0     0
	  mirror-0                                ONLINE       0     0     0
	    7403f6bc-8b82-48a8-85f5-9577ea321109  ONLINE       0     0     0
	    276fbdc8-621e-4450-8336-5f926fcc8452  ONLINE       0     0     8

errors: No known data errors

After the disc switch, the old drive had a successful extended test. Also i have to mention, that I had the disk plugged out for cleaning the nas case, and then plugged in again. Is there some kind of stress test I can perform on the drive to see if it the errors come back?