We’re bringing some SMART options back

Couldn’t have said it better.

I’m extremely concerned about iXsystem staff’s tone in this thread. They’re coming off as “Stupid users don’t know that SMART isn’t required and they’re noisy. Let’s placate them passive aggressively and move on.”

Instead, they could summon the courage and effort to document precisely why SMART isn’t required and articulate their thinking, but they haven’t and continue to seem adverse to that (god only knows why).

What’s easily given is easily taken away.

4 Likes

Please allow disk spindown again :sleepy_face:

Jumping in on the BMS/background scan conversation.

Background Media Scan is something that’s more common on SAS drives - at least more commonly controllable there - but it’s also present on some SATA devices. BMS is an always-running background job, like a low-priority patrol read - it’s like a SMART Long in this sense that it’s checking the entire disk surface, not just allocated data. This would catch a potentially marginal or bad sector. If it’s correctable with on-platter ECC it’ll just be rewritten (if allocated) and tick up Hardware_ECC_Recovered on the disk - if there’s no data there, and the sector can’t be refreshed in place, then you might see it as Reallocated_Event_Count but without a matching Reported_Uncorrect. BMS runs in-firmware on drives, and while there’s a degree of control you can exercise with sdparm and probing specific drive pages, it’s largely left to drive firmware to handle it - ZFS doesn’t get involved here.

  1. Fairly common on modern SAS and SATA drives.
  2. Runs constantly in the background when drives have been idle for more than a vendor-defined period. Brand new drives will have an accelerated schedule and aggressively scan when idle for even single-digit milliseconds, once that’s done it’s more typically 500ms of idle time required.
  3. Sometimes this can be fished out of the drive’s control page for the job specifically - otherwise, it’s just an always-on job with the results showing up as increases in the aforementioned counters.
  4. It’s basically just walking the entire LBA range, so it will “resume” from where it was and then loop around again.
  5. It’s testing each LBA to see if it’s readable. No write testing is performed.

Further to BMS is Media Pre-Scan - which is a table that’s tracking the allocated sectors to determine if it’s the first time they’re being written to. If a sector is getting a write for the first time, then a drive with Media Pre-Scan will turn it into a “write and verify” by immediately reading the data back.

We’ve always advocated for burn-in testing of drives before they’re put into use in a TrueNAS system - we do this ourselves for our own Enterprise gear as part of the build process. This means that the entire drive surface will have gotten a pass through Media Pre-Scan and any known marginal sectors would be mapped out and reallocated before data hits them.

A sector “going bad” after passing a successful self-test, and after a BMS pass is possible - but even in the scenario where it can write OK but not read, the redundancy of ZFS will protect the data there.

5 Likes

This is the most substantive answer we’ve gotten so far. I think I’d like a bit more detail but I wish we had had this information/detail back in November.

1 Like

That’s an awesome description of how BMS helps scan for errors one some drives.

But we need a solution that works with most, if not all, drives.

Ideally a solution where used and unused sectors get scanned for incorrect-able errors, reallocations, that signal (just as bad blocks does) when a drive may no longer ready for prime time.

For decades, ZFS + SMART shorts / longs did that for us, warts and all. If something truly better comes along, I suggest the team document the improvement opportunity in detail. Not with anecdotes re: drives getting kicked out prematurely, etc. but hard data.

Your enterprise customers presumably are sophisticated enough to review the costs and benefits of the new TrueNAS defaults vs. decades of dogma re: setting up a FreeNAS / TrueNAS. If you’re going to sell it as a cost saving feature, some documentation would go a long way to justify same.

In the meantime, thank you for bringing GUI SMART scheduling back in future editions of TrueNAS. After the reactions by management in the feature request, closing of the thread, etc. I didn’t think that was possible.

3 Likes

@HoneyBadger
Thank you for the detailed explanation. Reading your description of the platters being checked constantly (ish), it reminded me of a feature that I use to see in DVR hard drives, the constant (every few minutes) sweeping of the heads to ensure the platter wear was even across the entire platter surface. That was a long time ago and I don’t know if some drives still do it, but that is not this conversation, it just brought back a memory from a long time ago.

A quick AI search (don’t Boo me) had this to say:

Now I’m digging into some more stuff out of curiosity.

Enterprise customers buy Enterprise support from us. After that all drives are replaced free of charge. However, drive failures and performance impact of resilvering is still unattractive. Our aim is to minimize disruptions and make the experience very smooth. Nearly all customers have hot spares… or Z2/Z3 layouts.

For smaller users, we assume the preference is that drive replacements are minimized for cost and convenience reasons. However, data needs to be reliably stored.

For small deployments, drive replacement might be the preferred option if the drive is still under warranty. Thus any feature that masks or delays warnings about potential issues that would result in a successful RMA could be seen as being counterproductive.

However, for Enterprise contracts where you have to replace the drives at your own cost, I understand the rationale of trying to minimise disruption and delay the inevitable.

6 Likes

I invite everyone, but especially Chris to join me over at Is my drive running background media scans (BMS)? How often? - TrueNAS General - TrueNAS Community Forums

That is a good point. TrueNAS should aim to simplify that process.

Replacement drives will often be refurbished, so they also need to be validated more carefully.

We do recommend burning in drives before use, but understand that many home users have neither the time or resources to do that easily.

Is there anything on the roadmap to perform drive burn-in testing right within the UI?

That would be a feature request. .. or a Community script contribution. It has to be done carefully to avoid impacting system perormance.

At TrueNAS, we use an entirely different piece of Test software to do this for our Enterprise appliances. It tests CPU, RAM, NICS, drives and looks for corner cases we discover. We run this Test software first (2 days) and then load TrueNAS… so the Test software can evolve separately (and rapidly) from the TrueNAS software and it is not responsible for keeping the NAS system functional while testing.

To test drives, we just load them into a old system and run the Test software.

This process is time consuming, but necessary for Enterprise quality.

There are already community scripts to do this (e.g., GitHub - dak180/disk-burnin-and-testing: Shell script for burn-in and testing of new or re-purposed drives · GitHub), but it would be very good for this to be incorporated into the GUI.

1 Like

Feature request is best.

I would like to hear more stories of people using the script and their satisfaction. If so, they can also vote.

The bar is high in that any feature like this needs to have low impact on a running system/pool.

I assume the desire is for a burnin capability… not this script specifically?

I have thoughts on this (and a great deal many other things) but am waiting for the March changes to release first.

26 will no longer release in march… According to the t3 podcast, the first beta is planned for march/april and release somewhere in the fall

From the OP…

announcing plans is not releaseing new features…

As far as I’m concerned, yes. I’m not even concerned that it be done in the same way, so long as

  • It reads and writes every block on the disk, at least once
  • It exercises the disk heavily for an extended period of time (I’d say about a week as a minimum; I recall our formerly-resident grinch arguing for closer to a month)
  • It will report any block errors or other failures (which SMART attribute monitoring would ordinarily take care of).

I know it’s only anecdotal, but I’ve never noticed any performance degradation while burning in a disk. It certainly wouldn’t have any effect on any pools, as a disk you’re burning in wouldn’t yet be part of a pool. Though it would consume I/O bandwidth for the system, of course.

3 Likes

You’re taking my comments too literally.