Re‑using old HDDs that ZFS may have incorrectly flagged as errored

I have had a bit of trouble with 10 Ironwolf 10TB HDD’s, when they were about 4 years old, I transferred them into a TrueNAS system made up from on old PC and then started getting a couple of checksum errors.

Over the next couple of months 6 of the drives degraded with errors.

Two of the HDD’s were clicking and powering up and down so I know those drives are dead. The rest of them seemed to be fine according to SMART data.

I was using long SAS 60cm cables to connect them to my HBA in IT mode could that be the cause of the checksum errors?

I ask this as since I moved the disks to new purpose built Mini ITX TrueNAS system none of the remaining drives have had any issues at all the only difference is much shorter SAS cables and not using an HBA (I did need to add a NVMe to SATA to add 2 extra SATA ports), and its been a couple of years.

I am currently running long SMART on the errored drives.

The 4 drives that were not clicking have been sat on a shelf for 2 years waiting to be disposed of but given the current price of replacement drives I want to see if they are ok and keep them as cold spares.

I purchased 2 more HDD’s few years ago to replace 1 errored drive and kept 1 new unused 10TB Ironwolf spare.

Is this madness?

UDMA CRC errors are usually associated with bad cabling, connector oxidation, or like issues.

I would suggest you take the drives in question, put them through a SMART Long, full bad blocks, and another SMART Long, using known-good cabling. Then make a decision.

4 Likes

Sound like a good plan.

I haven’t heard of the badblocks command before, I think I will use the destructive write test.

Look for threads here that describe how to test a drive prior to using it in a NAS. Some folk call it provisioning, but looking for bad blocks and tmux commands should do the trick. Good luck.

60 cm should be within spec even for SATA drives (which I assume is the case since you could move the drives to another system without a SAS HBA), but shorter is better. And you may have a bad cable—or insufficient cooling on the HBA.
It is sensible to re-test the drives with different hardware. The error flag is correct, but the fault quite possibly does NOT lie with the drives.

Here is a shell script I was not aware of that rolls all the usual disk burn-in tests into one script, courtesy of @dan (who made me aware) and @dak180, who wrote it. I have never used it but it looks comprehensive. It should be part of the GUI for TrueNAS CE.

Voted thanks!

I would vote for it, if I had votes left. Need to delete those… be right back

Trying to run badblocks but it is throwing up an error.

sudo badblocks -wsv /dev/sda
badblocks: Value too large for defined data type invalid end block (9766436864): must be 32-bit value

Any Ideas what I am doing wrong?

Yes, your disk has too many blocks. Badblocks is an old tool, and can only use 32-bit numbers to count them. The workaround is to add the -b flag to tell it to use a larger “block” size: badblocks -wsv -b 16384 /dev/sda.

3 Likes

Thank you its working now :slight_smile:

The 1st long SMART test came through with no errors on the 2 disks I was testing overnight.

badblocks -c 2048 -b 4096 -wvs /dev/sda is what I run -c speeds it up a bit. I don’t see much increase beyond 2048 though.

1 Like

-b 4096 (4k blocks) is the minimum to handle large drives.
Doing multiple blocks together (-count) speeds things up… but only as far as the drive can actually handle so much in one go. My limited testing concurs that anything between -b 4096 -c 1024 and -b 4096 -c 2048 is the practical plateau; going higher might even be slightly detrimental.

1 Like

Badblocks 55 Hours in and both drives have had no error’s after 1st R/W pass.

Can I assume they are Okay? At this rate it will take seven more days to complete the scan.

I guess I am not desperately needing the drives at the moment, and given the age of the drives it probably would be wise.

What’s is the consensus?

If it were me, I’d run them through all four passes.

1 Like

Thank you all for your responses.

I ran badblocks for 220+ Hours and it found no errors. Strangely after badblocks finished both drives experienced a lot of disk seeking noises for about 45 minutes. AI suggests the disks were performing some internal housekeeping after such a long time of activity.

After the disks settled down I tried to tun Conveyance and short SMART test but they got stuck at 90% on both disks.

I think it was my USB docking station throwing a wobbly since both disks where experiencing the same problem. I know using USB docking station is not ideal, but it was the easiest way of doing the testing with the hardware I have available.

Anyway after a quick reboot and power cycle of my USB Hub the short SMART, conveyance, and extended SMART test completed without error

So for over a week of testing both disks are just fine. To think I was going to destroy them! I am going to test the last two errored disks and then do the same for my old cold spare.

I now understand the value of rigorous testing on disks.

2 Likes

Two more HDD’s have passed the test!

What would be the best option, keeping 5 HDD as cold spares or to build a second NAS to replicate the first one?

Too personal - depends on your budget/available parts.

If you can swing it, though, I don’t see a reason not to have an actual backup. Considering ram is worth more than its weight in gold though…

As it happens I have 64GB of DDR4 sitting in a drawer unused and another 32GB from another unused system (the one that caused all this trouble to begin with). Made up of 16GB DIMMS.

I do have a backup already being a USB HDD which I take a robocopy mirror, and a Macrum Reflect files and directory’s image of the NAS every Monday.

1 Like