Badblocks read errors / no SMART errors

Looking for some advice please.

I’m in the process of burning-in 2x brand new Seagate EXOS X18 18TB drives. Prior to beginning the burn-in I upgraded them both to the latest firmware (SN06), changed the logical sector format from 512B to 4KB and disabled head parking using SeaChest. They have TLER set to 10s by default.

My normal burn-in process is:

  1. SMART short + conveyance + long tests
  2. Four-pass (default) badblocks destructive test (using “-wsv -b 8192” parameters)
  3. SMART long test

The first drive is about to complete #2 with no errors reported, so looking like it’s all fine.

The second drive I’m having trouble with however, and where I’m looking for some advice.

After completing #1 the SMART data showed a handful of UDMA_CRC errors, but otherwise all good.

It then however failed the badblocks test in #2, reporting a massive number of errors during the “reading and comparing” phase of the first (0xaa) pass. Stupid me didn’t record the output, but I believe they were all read failures, and it was enough errors that badblocks stopped itself (I can at least say that the list of bad block IDs exceeded my scrollback window).

I then ran another SMART long test on it in preparation for a RMA - only for SMART to show no disk issues. Another handful of new UDMA_CRC errors, but otherwise fine.

So I tried another badblocks pass - this time it reported a small number (six) of read errors during the initial (0xaa) pass, and then no additional errors during the subsequent two complete passes (0x55 and 0xff). I stopped badblocks during the fourth pass so I could remove / re-seat the drive.

Right now it’s running another badblocks pass following the remove / re-seat - no errors yet, but it’s only a few percent into the first pass.

I’ll also add that the drive appears to be typically performant (i.e. it’s not running slow - badblocks is taking about as long to run a pass on this drive as it does on a completely healthy/normal drive).

Any thoughts on whether this is likely a bad drive, or if it’s more likely that I experienced some sort of strange TrueNAS / firmware/ etc glitch? It feels a bit more like the latter to me, given these were all read errors + the extremely large number on the first pass followed by a handful on the second followed by none subsequent, combined with no drive issues reported by SMART.

In an ideal world I’d just RMA the drive, but with SMART data reporting no issues I’m going to have an uphill battle on my hands with the retailer. Especially if I can’t get badblocks to produce consistent results showing failures.

Thanks.

UDMA_CRC is >90% chance of wiring fault/controller fault imo.

Reseating wires, replacing sata data wire (with shorter ones when possible), ensuring proper cooling on HBA, etc. are basically first troubleshooting steps for this.

You’ve now done THREE consecutive correct things; you burned in the drives, upon discovering a fault you took steps to mitigate, and you’ve started tests again after the mitigation steps.

IMO; drive is very likely fine. Wait for the full run of badblocks, but you prolly fixed it on the reseat.

1 Like

I guess so.

Contact support and try to RMA. In my experience, Seagate support is very inconsistent. Sometimes they simply accepted the RMA, sometimes I had to show them the output of SeaTools.

SMART is only a ok tool to estimate a drive failure. Just because there are no SMART errors, does not mean that the drive is healthy.

IMHO it does not have to be consistent, the fact that you had errors in both, is reason enought to RMA. Especially considering you don’t have this issue with another drive on the same system (and probably same controller?).

1 Like