Looking for some advice please.
I’m in the process of burning-in 2x brand new Seagate EXOS X18 18TB drives. Prior to beginning the burn-in I upgraded them both to the latest firmware (SN06), changed the logical sector format from 512B to 4KB and disabled head parking using SeaChest. They have TLER set to 10s by default.
My normal burn-in process is:
- SMART short + conveyance + long tests
- Four-pass (default) badblocks destructive test (using “-wsv -b 8192” parameters)
- SMART long test
The first drive is about to complete #2 with no errors reported, so looking like it’s all fine.
The second drive I’m having trouble with however, and where I’m looking for some advice.
After completing #1 the SMART data showed a handful of UDMA_CRC errors, but otherwise all good.
It then however failed the badblocks test in #2, reporting a massive number of errors during the “reading and comparing” phase of the first (0xaa) pass. Stupid me didn’t record the output, but I believe they were all read failures, and it was enough errors that badblocks stopped itself (I can at least say that the list of bad block IDs exceeded my scrollback window).
I then ran another SMART long test on it in preparation for a RMA - only for SMART to show no disk issues. Another handful of new UDMA_CRC errors, but otherwise fine.
So I tried another badblocks pass - this time it reported a small number (six) of read errors during the initial (0xaa) pass, and then no additional errors during the subsequent two complete passes (0x55 and 0xff). I stopped badblocks during the fourth pass so I could remove / re-seat the drive.
Right now it’s running another badblocks pass following the remove / re-seat - no errors yet, but it’s only a few percent into the first pass.
I’ll also add that the drive appears to be typically performant (i.e. it’s not running slow - badblocks is taking about as long to run a pass on this drive as it does on a completely healthy/normal drive).
Any thoughts on whether this is likely a bad drive, or if it’s more likely that I experienced some sort of strange TrueNAS / firmware/ etc glitch? It feels a bit more like the latter to me, given these were all read errors + the extremely large number on the first pass followed by a handful on the second followed by none subsequent, combined with no drive issues reported by SMART.
In an ideal world I’d just RMA the drive, but with SMART data reporting no issues I’m going to have an uphill battle on my hands with the retailer. Especially if I can’t get badblocks to produce consistent results showing failures.
Thanks.