Hi,
my NAS (TrueNAS Scale) has 6 hard disks and one of them recently showed errors. The SMART Info for the drive showed: Last Short Test and Last Extended Offline Test both with: FAILURE!
But the zfs Info of the drive showed NO errors!
I replaced the drive anyway and resilvered the new drive. But I have some questions:
as long as there are no zfs errors, would it be “save” to keep using the drive until zfs errors occur? My pool is a RAIDZ2 that means two of the drives can break down without data loss. I mean should I replace the drive as soon as any errors occur or first when zfs errors occur?
Is there anything i can do to test if the old drive is still usable? Maybe format and retest or such? Or should I throw it away?
That’s what redundancy is for: Preserve data integrity even when the underlying media is not reliable.
But for ZFS to properly protect your data, you should still react to hardware failures and act timely—which you have done. When ZFS errors occur, you have lost data.
You may use this resource to assess the condition of the failed drive:
But, essentially, if it has failed a long SMART test, the drive is due for RMA or for disposal.
No, you haven’t. At least, not necessarily. If you have a mirrored vdev, and one device shows a ZFS checksum error while the other is clear, you haven’t lost any data.
I think that’s an unduly narrow definition. I’d consider any non-zero value under READ, WRITE, or CKSUM to constitute a ZFS error, like this:
admin@nas[~]$ sudo zpool status software
pool: software
state: ONLINE
scan: scrub repaired 0B in 00:04:17 with 0 errors on Sun Mar 30 00:04:18 2025
config:
NAME STATE READ WRITE CKSUM
software ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
7403f6bc-8b82-48a8-85f5-9577ea321109 ONLINE 0 0 0
276fbdc8-621e-4450-8336-5f926fcc8452 ONLINE 0 0 8
errors: No known data errors
After the disc switch, the old drive had a successful extended test. Also i have to mention, that I had the disk plugged out for cleaning the nas case, and then plugged in again. Is there some kind of stress test I can perform on the drive to see if it the errors come back?