We’re bringing some SMART options back

Jumping in on the BMS/background scan conversation.

Background Media Scan is something that’s more common on SAS drives - at least more commonly controllable there - but it’s also present on some SATA devices. BMS is an always-running background job, like a low-priority patrol read - it’s like a SMART Long in this sense that it’s checking the entire disk surface, not just allocated data. This would catch a potentially marginal or bad sector. If it’s correctable with on-platter ECC it’ll just be rewritten (if allocated) and tick up Hardware_ECC_Recovered on the disk - if there’s no data there, and the sector can’t be refreshed in place, then you might see it as Reallocated_Event_Count but without a matching Reported_Uncorrect. BMS runs in-firmware on drives, and while there’s a degree of control you can exercise with sdparm and probing specific drive pages, it’s largely left to drive firmware to handle it - ZFS doesn’t get involved here.

  1. Fairly common on modern SAS and SATA drives.
  2. Runs constantly in the background when drives have been idle for more than a vendor-defined period. Brand new drives will have an accelerated schedule and aggressively scan when idle for even single-digit milliseconds, once that’s done it’s more typically 500ms of idle time required.
  3. Sometimes this can be fished out of the drive’s control page for the job specifically - otherwise, it’s just an always-on job with the results showing up as increases in the aforementioned counters.
  4. It’s basically just walking the entire LBA range, so it will “resume” from where it was and then loop around again.
  5. It’s testing each LBA to see if it’s readable. No write testing is performed.

Further to BMS is Media Pre-Scan - which is a table that’s tracking the allocated sectors to determine if it’s the first time they’re being written to. If a sector is getting a write for the first time, then a drive with Media Pre-Scan will turn it into a “write and verify” by immediately reading the data back.

We’ve always advocated for burn-in testing of drives before they’re put into use in a TrueNAS system - we do this ourselves for our own Enterprise gear as part of the build process. This means that the entire drive surface will have gotten a pass through Media Pre-Scan and any known marginal sectors would be mapped out and reallocated before data hits them.

A sector “going bad” after passing a successful self-test, and after a BMS pass is possible - but even in the scenario where it can write OK but not read, the redundancy of ZFS will protect the data there.

5 Likes