I am running a virtualized truenas scale instance with an HBA in PCI passthrough. The root disks are virtualized from the host, while the data disks are connected to the HBA.
I have 4 total disks organized in 2 mirror VDEVs. I run the pool scrub daily, smart short test daily for all disks, and the long test is ran weekly, mon to thu, one disk per day. I do not run any check on the boot pool as it is virtualized & backupped anyway.
As of yesterday I got an alert about /dev/sde failing the smartctl long test. I have examined the logs, and found one reallocated sector, plus four ATA errors.
I manually scrubbed the pool and cleared the errors, as it successfully fixed everything up, during the process I got another ATA error, total of five, and another reallocated sector.
I finally ran another long test on sde, but it failed again at around 60% remaining; the LBA looks to always be the same at least looking at the ATA errors.
I am a bit at a loss about how to investigate/debug this further; my impression is that I have to closely monitor this disk and see what happens. This can still be a failing cable, but for some reason it does not look like a cable problem to me.
Here are the info I think are relevant:
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.32-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 60% 6152 -
# 2 Short offline Completed without error 00% 6139 -
# 3 Extended offline Completed: read failure 60% 6138 -
# 4 Short offline Completed without error 00% 6125 -
# 5 Short offline Completed without error 00% 6101 -
# 6 Short offline Completed without error 00% 6077 -
# 7 Short offline Completed without error 00% 6053 -
# 8 Short offline Completed without error 00% 6029 -
# 9 Short offline Completed without error 00% 6005 -
#10 Short offline Completed without error 00% 5981 -
#11 Extended offline Aborted by host 30% 5978 -
#12 Short offline Completed without error 00% 5956 -
#13 Short offline Completed without error 00% 5932 -
#14 Short offline Completed without error 00% 5908 -
#15 Short offline Completed without error 00% 5884 -
#16 Short offline Completed without error 00% 5860 -
#17 Short offline Completed without error 00% 5836 -
#18 Extended offline Completed without error 00% 5815 -
#19 Short offline Completed without error 00% 5788 -
#20 Short offline Completed without error 00% 5764 -
#21 Short offline Completed without error 00% 5740 -```
I ran a total of three long tests on sde but it keeps stopping at 60% remaining. I still have a single offline uncorrectable, and no more ATA errors, so total of three. Reallocated sector count is now up to three.
I think I can assume this disk is toast, or soon to be. Anyone else has some idea? It seems to be going slowly but I am unsure if I can accept a disk that just cannot complete a smart long test.
I wouldn’t let that drive be used in any pool. Consider it failed or failing.
Don’t rely on ZFS to always get you out of this jam without replacing the drive.
How old is the drive? According the the Power_On_Hours, it has been running for less than a year. Surely it’s covered by warranty? Even Seagate “recertified” drives have a 1-year warranty.
Thanks @winnielinnie I appreciate you taking some time to answer me.
I bought 4 identical drives off of Amazon this march, all refurbished of course. Do you happen to know how the RMA process would work? Seagate tells me to contact Amazon, apparently - I guess I have to buy another drive in the meantime though.
From what I understand, refurbished drives are covered only for 90 days, while manufacturer recertified are covered anywhere from 6 months to 2 years, depending on the seller.
EDIT: To make matters more confusing, “recertified” drives listed on Amazon may in fact be refurbished or “seller recertified”, but not manufacturer recertified.
I have a hunch your drives are only covered for 90 days.
That’s good news! Is Amazon requiring you to return the drive first, before issuing a refund or exchange, or will they reimburse you and let you return the drive later?
At least if you can keep this drive in the pool until getting a replacement, you might be able to prevent the pool from falling into a degraded state.
No they want the drive first and I get the money after. I ordered a new one, will do the resilver with both drives in and then offline and detach the failing one - then I will ship it back to them.
Yes, I typically order a new replacement drive… and then RMA the failing one (if in warranty), as it’s better to have a failing drive than a failed or removed drive
The same applies when replacing, if possible leave the old drive connected while performing the replacement.
Now I am curious though - this drive seem to have just a single failing sector, which I would normally expect to be reallocated at a certain point, especially since this is in a mirror VDEV.
My understanding is that if this sector contained data, during a scrub it would be overwritten and therefore reallocated. Since this has not happened, I assume the sector currently does not contain any data; in fact my pool is only 20% full.
Now, is there any way I can try to write something to that specific sector? I would like to see if it gets reallocated, and if subsequently a long test passes.
What I also do not understand is why a long test just stops & gives up: if this is a single bad sector, then surely the long test can highlight this and proceed? Does this mean that everything after that sector is FUBAR?
Thanks to both - that was actually my assumption about zfs in a mirror vdev:
scrub is started, data & checksum is read “section by section”* from both disks
if it corresponds, all good
if it does not correspond, we try to determine the good section via the checksum
if this goes well, data is overwritten on the bad section → this would trigger a reallocation
if this does not go well, you got unlucky and the error is probably uncorrectable
I have to say I did have a warning for the whole pool when this all started, I triggered a scrub and it went away, but the offline uncorrectable remained. Perhaps the offline uncorrectable does not contain any data & therefore ZFS does not scrub it…
* I use the word “section” as in “chunk of data” I am not sure if ZFS works by sector, block or whatever - not really relevant in this case I would say.