Include disk burn-in in TrueNAS GUI

Problem/Justification
Infant mortality of hard disks. We’re all familiar with the so-called “bathtub curve,” which illustrates that there’s a fairly-significant risk of device (in this case, disk) failure early in its operational life. Aggressive testing before putting disks into production weeds out marginal disks before data is committed to them.

Impact
User-created guides (e.g., fester:hvalid_hdd [Dan's Wiki])[1] have long recommended a burn-in regimen be performed at the shell before putting disks into production. Some users have developed scripts to partially automate this process (e.g., GitHub - dak180/disk-burnin-and-testing: Shell script for burn-in and testing of new or re-purposed drives · GitHub). Including it in the TrueNAS GUI both simplifies the user experience, and elevates this recommendation to “official” status.

The obvious drawback is that if the user does this before putting disks into production, it will delay putting those disks into production. But that isn’t a bug, it’s a feature. It shouldn’t affect pool performance in any way, because disks being burned in aren’t yet part of a pool. It will consume system I/O resources, but I wouldn’t expect that to affect overall system performance on anything but the lowest-spec systems.

User Story
I can imagine a couple of possible ways this could appear in the TrueNAS UI:

  • (Preferred) include a checkbox when creating a pool, adding a vdev to the pool, or replacing a disk in a pool, to burn in the disk(s) first. This would include a prominent notification indicating how long it’s expected to take, and that the the disk/pool will not be available until this is complete. On pool creation, this should probably include the option to exclude SSDs.
  • As an item in a separate “Utilities” menu. In this case, the user would run this routine manually before creating the pool or adding the disk(s) to the pool. This would be nice to see in addition to the above, but if only one is implemented, I’d strongly prefer the above.

The “traditional” method of burn-in testing has been a long SMART self-test, followed by a full badblocks run (which writes, then reads, every block on the disk four times), followed by another long SMART self-test. I’m not married to this exact process, but I’d think something similarly-rigorous would be required.

@Captain_Morgan suggested this feature request, so here it is. Let your votes be heard, everybody!


  1. I feel the need to point out that while I host Fester’s guide, I didn’t write it, though I haven’t seen Fester around for a long time. ↩︎

11 Likes

As promised, here’s my vote

1 Like

I would add a vote, but i have 0 left, and some of the accepted or already implemeted feature requests hog votes and are not getting released :frowning:

That’s kind of surprising–perhaps the threads weren’t closed? If not, you should be able to “take back” your votes on those threads.

Yeah, i tried to take it back, but it wouldn’t let me on closed features

My votes are counting correctly. Have you tried checking the number of votes on an active FR you voted for? Click the VOTE button and it should tell you used / total and give you the option to take back a vote.
Closed are only showing who voted but the votes are not taken up

weird, cleared browser cache and now it worked, added my vote!

1 Like

I’m new to FRs on this forum but I find the vote system stupid. If one can’t get “vote credits” beyond the hard limits, then they’re a super precious commodity and I don’t want to squander them (could argue that’s working as intended) but … it’s just dumb. Never seen this on any other FR voting system.

It’s using the price system where the price system doesn’t make sense when everything has the same price.

I think there’s merit to this FR. I have opinions, but probably not worth detailing here (yet). I will say though that I much prefer the second option that this should be in the utilities menu.

That is pretty much working as designed–the stated intent is to force users to prioritize their votes. And it makes some sense–if people can vote for an unlimited number of feature requests, it doesn’t do anything to show them how important the FR is to how many people. It’s kind of a crude way of prioritizing, but it’s something.

2 Likes

Works fine on every other FR system I’ve seen.

I’d argue that it’d need a lot of training wheels - checks to prevent it from being run on disks in active pools, standard warnings about all data on the drive being gone, warnings on the time it’d take, etc. etc.

I’m worried about badblocks being run by someone who can’t/won’t put in the 5 minutes of efforts into tmux using running it with a gui checkbox. Seatbelts will be required. I hope it doesn’t come across as gate keeping or elitism or whatever, but with improper limits this can be equivalent to removing French language pack.

Concerns aside - voted!

1 Like

My preferred version of this would automatically do this, because it would only be available for disks that are being added to a pool, either to create a new pool, to add a vdev, or to replace a disk. The system already doesn’t show disks that are part of an existing pool for any of those purposes, and it already warns about all data on added disks being lost. But yes, it’d need a fairly prominent notice that this will take some time, and best if it would give a reasonable estimate of how much time that would be.

If it’s something that’s just available in a “Utilities” menu, then yes, more seat belts/warnings would need to be coded.

2 Likes

ZFS is always great at giving time estimates :smiley:

Edit: i agree with Dan’s response - this was just meant to be a joke

But it wouldn’t be a ZFS thing at all. Assuming they implement the “normal” process of long SMART/badblocks/long SMART again, the disk itself will report an estimate of how long the SMART test will take, and you can pretty closely estimate how long the badblocks run will take based on the disk’s capacity and rotational speed.

1 Like

As I said in the original thread discussing this, I think a non-destructive option (as in badblocks -n switch) would be good to allow tests on existing pools. But I agree that the most important one would be the optional burn-in when adding a disk. And yes, I also think that it would require a lot of information/warning for the user. Maybe even a bit more than deleting datasets with its “type to confirm” prompt.

First of all, I voted for this Feature Request with my valuable votes. I only vote on things that I can see would provide a real benefit to everyone, not just a small handful of people. This is one of those where everyone would benefit.

I don’t agree with this specific statement, the reason is, when you are adding a disk to a pool, you have added a huge time constraint, which could be well over a week for those huge 24+ TB drives. Let’s not forget that over 32TB drives are out there as well. I personally would not like it, any corporation would hate it. And if this were to be a drive that the system is automatically replacing due to a drive failure, who wants to wait for the spare drive to pass a lengthy test before it could RESILVER in a replacement drive.

How did I come up with weeks?

  1. A SMART Long/Extended test can take well over 64 hours to complete, and much longer for those huge drives that more and more people are purchasing these days. For argument sake, let’s just call it 64 hours (just over 2 1/2 days).
  2. Bad Blocks is not the same as a SMART test, Bad Blocks uses the computer/cpu to write a pattern to the drive, then read the pattern back, and there are 4 patterns so this is 8 operations and will take just over 8 days to complete, barring any problems or bottlenecks.
  1. The grand total:
    a: SMART Extended Test = 2.5 Days
    b: Bad Blocks Test = 8 Days
    c: Hypothetical time on a 24TB drive = 10.5 days

My solution: Do not make it a fully automatic solution:

  1. Have a manual option in the GUI to “Run Burn-In Test” and the list of drives would only be drives that are not assigned to an active pool. This means that an old drive that was previously in a pool, could be tested. It would also mean a person could destroy a pool drive that they only placed offline. But it will warn you with the pool name it came from and ask if you are sure you want to do this. You can’t solve every issue, but you can put in place reasonable safety measures.
  2. When the system recognizes a new drive installed, and that does not have a pool defined, a notification is provided to inform the user that a new drive is available for a Burn-In Test. If the new recognized drive was previously part of a pool, same thing as above, list the pool name and give the user a warning.
  3. A nice touch would be if after a Burn-In Test is successful, the drive be either marked by writing some type of data to the drive, maybe a single small file which states the time/date the test was completed. Think of it as a “README.txt” file that TrueNAS could read to check if the drive was tested.
  4. Assume a readme.txt file is on two spare drives in the system and you have two other spare drives that have not been tested. TrueNAS could now select a tested drive to replace a failing drive, and leave the untested drives alone, unless they are the only drives left.

I am only laying this out as something I think might work and cause little interference to someone who needs to replace a drive fast, and also allow a person to Burn-In a drive as well. If TrueNAS could tag the drive as already being tested, that would be a nice touch but I doubt that would happen.

But I do think having an easy method to run these tests would be beneficial to everyone, even if I sort of complicated things with my opinions.

2 Likes

This would be a great option to have. It does perform a read of the sector, writes a test pattern, reads the test pattern, then writes the original data back to the sector. It is not as through as a full blown burn-in test, but if you offer it up to people, they may skip using the complete burn-in.

But it would be a nice option, it is better than nothing if a user had to choose between the two due to time constraints.

1 Like

From what I know the non-destructive test takes ages. So, I don’t know if it would be a time saver. But I have never compared the same disk with different methods.

I was going to wait until the announcements around SMART were made, but seeing this feature request is still “hot” I will make my comments here and now.

Utility vs Replacement Pipeline

First, I think this should be a utility and not something that should be done during disk replacement. KISS. These are different operations. A replacement is done to replace a disk. A burn-in is done to burn-in a disk. Separate the two functions out for now.

Once we have the burn-in feature? Yeah, then I could see a “pipeline” of disk burn-in and then auto-replacement, but I think we should decouple these functions for now until such a function is proven. After all, we’re debating the merits of a full write test which is a destructive action.

Methodology

I’ve put (some) though into this. Here’s my pitch. Again, I’m an idiot. I’m not a storage engineer. Just one perspective. Here’s one “algorithm” I’d recommend discussing.

When I write “job” below, think “burn-in test”.

  1. User manually calls a burn-in test on a specific disk. User consents to all warnings/disclaimers.

  2. TN creates a job/database entry somewhere that the disk by that S/N is blocked from other tasks/use and is subject to a burn-in test.

  3. TN creates a single-drive stripe vdev and ZFS pool (foo) with that drive.

  4. TN creates a dataset in the foo pool called bar and gives it a reserved amount of space (not massive, a few MB ought do).

  5. TN creates a zvol in the foo pool called baz, forced beyond 80%, sparse size, same size as the raw disk (if engineers want to fuss about with the exact size of baz to avoid error conditions for reasons that will become clear later, go ahead).

  6. TN tracks the above creations in the job/database entry.

  7. TN starts a script/process of which the core funtion is to run ddrescue -f /dev/urandom /dev/zvol/foo/baz /mnt/foo/bar/mapfile

  8. TN waits for one of several things:

    1. (A) the ddrescue process to complete (error out due to no free space) or…
    2. (B) ZFS to alert of issues on the foo pool or…
    3. (C) SMART monitoring to trigger alerts
  9. If there is evidence of issues in 8B or 8C, the job terminates early at a failure condition, interrupts the ddrescue process if applicable, alerts the administrator, and terminates the burn-in test. The administrator has the option to ‘clean up’ the foo pool within the TN UI.

  10. If 8A completes successfully (out of space condition), TN automatically begins a scrub of the foo pool.

  11. If the foo scrub encounters read/checksum/etc errors, TN interrupts/cancels the scrub (to not waste further resources/time), terminates the job, and alerts the administrator. The administrator has the option to ‘clean up’ the foo pool within the TN UI.

  12. If the foo scrub completes successfully (no checksum/read errors), TN completes the job and alerts the administrator (just like any other scrub completion). The foo pool, bar dataset, and baz zvol are all automatically deleted. The S/N is removed from the block list in step 2.

Benefits

  1. Zero reliance on SMART. No need to trust the vendors.

  2. It uses ZFS. We benefit from mathematical checksums. We know if the disk is lying. We don’t care for the purposes of a burn-in test of reconstructing data. We just need to detect faults.

  3. It’s a nearly full write and read test of the disk. (Yeah there’s a small amount missed, but then you’re squabbling over incredibly small portions of disks.)

  4. It’s random data and more sophisticated than a simple 0x00 or 0xFF write.

  5. It’s fully interruptible/resumable, even in the event of power loss.

    1. Scrubs can be paused/resumed on graceful shutdown/startup. Scrubs also checkpoint their progress in the event of an ungraceful shutdown.
    2. Ddrescue’s mapfile allows for the same “checkpointing” mechanism in the event of ungraceful shutdown. How often it updates the mapfile is configurable (--mapfile-interval).

Downsides

  1. This is a write and read test which some users may not like.

  2. It’s computationally expensive as we’re fetching random data and there’s no rate limiting.

    1. I’m sure the smart people here could think of 100 different ways to rate limit this. The “KISS” approach could simply be to let ddrescue “let 'er buck” and if TN observes system pressure, it can interrupt the ddrescue process and restart it later. Again, it’s fully interruptible.
      1. Edit: I just realized ddrescue has a built-in --max-read-rate parameter. That might work.
  3. This will take a very long time, depending on disk size.

  4. I give no consideration above to “multiple passes”. This could perhaps be a UI option when the job is created.


inb4 “AI slop post” - markdown existed looooong before genAI.

2 Likes

Yes, it could easily be over a week (possibly over two weeks), which is part of why I suggest it be an option when adding disks to a pool. If you’ve already tested the disks, or you otherwise just don’t want to, you don’t have to–and it should be clear to the user that the burnin, if chosen, will take $ESTIMATED_TIME, and the pool/disk won’t be available until it’s finished.

If the system is automatically replacing a disk, the replacement is already part of the pool as a hot spare. In my imagined scenario, it would (or could) have been burned in before being added to the pool. In no case would the system automatically burn in a disk without user input.

As to the mechanics of how it’s done, that’s probably a bit beyond my expertise. What I regard as essential for this FR is (in rough order of priority):

  • TrueNAS ship with some form of burn-in utility
    • The FR is titled “in TrueNAS GUI,” but heck, even an included script that does the job would be an improvement.[1] An option on the console menu would be better. But I’d very strongly prefer it be somewhere in the GUI.
  • The test needs to be a write test, with at least two patterns[2]–it will write a pattern to every block on the disk, then read back every block, raising an error if there’s any mismatch, and then do the same with a second pattern. badblocks (by default) does four passes with four patterns (0x00, 0xff, 0xaa, and 0x55, IIRC), and I’d lean that way, but I’m open to counterarguments.
  • However it’s done, any failures need to raise errors in the GUI and otherwise through the alerts system.

Less important is that it be an option when adding a disk to a pool, but I think it’s a very logical thing to offer. And offering it there means the user already understands (in theory) that it’s going to be destructive (because adding a disk to a pool always is), and TrueNAS already prevents you from selecting disks that are part of an active pool when creating a new pool/expanding or adding a vdev to an existing pool/replacing a disk. But I agree there’s value in being able to do this at times other than when adding a disk to a pool.


  1. So long as it’s a comprehensive burn-in script, and it’s documented. @dak180’s script qualifies IMO. The combination of smartctl and badblocks that ship with TrueNAS does not. ↩︎

  2. In the event there’s some kind of glitch in the drive that causes it to always return the same contents for a block, this will catch that sort of error. ↩︎

1 Like