Why?
For the reason I state in the footnote that immediately follows that statement.
My B, somehow didn’t catch that footnote. I don’t understand what you’re trying to get at with a drive returning the same data. I’d further recommend that perfection is the enemy of the good on that one and simply allowing the user to run a second burn-in test (maybe with a dropdown of which pattern to test) probably makes more sense and helps address the “burn in testing takes too long on big disks” crowd.
The bottom line to me is that’s how badblocks tests, and that’s the reason I’ve seen stated for that–if a defect were to cause a disk to “read” 0x00 in a block, no matter what was actually there, a write test writing 0x00 won’t catch that, but a write test writing anything else would.
Is this likely? I don’t have any idea. I assume so, since they designed the tool that way, for that reason, but I don’t have any data to address it one way or the other.
The other point is that “taking longer” is a feature, not a bug. Part of the point of this procedure is to stress the disk for an extended period of time. How long is a matter of some debate, but “less than a day” probably isn’t long enough.
Why test at all with 0x00 then? Why not 0x55 alone or 0xff then? Or does the same problem persist regardless of pattern?
If badblocks can’t reliably catch a fault by a disk, why would we use it?
This is a very good discussion.
I hope this feature request becomes approved and results in something similar to what is being discussed. I’d like to see “options” in the GUI so a user can define the type of testing to be performed, but lack of those options, I’d lean towards the default of four patterns for bad blocks minimum.
And @dak180 has had this script posted on GitHub for awhile, so it has been tested and seems to be trusted by many. I personally have not used it yet but that is only because I’m moving away from spinning drives however I read the script last year and it looked really good to me. If I do purchase a spinning drive, I will test the script then. But implementing similar features would be probably the best outcome.
Keep in mind that this FR may conflict with the iXsystems development path. They are thinking more about Corporate (who pays the bills) usage. A corporation which purchases cases of HDDs are more than likely to not use this kind of feature. It required time and as we all know, time is money. Just an opinion.
If it’s possible that a disk could return 0x00 regardless of what’s actually on the disk, isn’t it equally possible that it could return something else? The failure mode I’m concerned with is one where the disk always returns the same data for one or more blocks, regardless of what’s actually written there. Whether it returns 0x00, 0xff, or something else, if what it returns happens to match the test pattern, then it will pass the test, despite actually failing.
Because it tests in the way I propose, for the reason I propose, which is why I propose it.
Yeah, I don’t really have a whole lot of faith in the whole FR system at this point, but Morgan specifically asked for one, so here we are. I’m honestly a little surprised it hit 10 votes within a few hours, and the current tally indicates there’s a decent amount of community interest in the future.
If only there were a filesystem that could tell us every time when a disk wasn’t returning the data it was originally given regardless of what data was written… /s
Would you be open to a hybrid of your badblocks idea + my pitch of using a temporary pool/zvol for “containing” the test? I’m not sold on random data/patterns being required myself. Again, not a storage engineer.
Come to think of it I’m not certain a zvol is strictly required, I think ddrescue is fine to operate on files as well, I just have no experience using it like that.
I’m not married to any particular method of doing this, so long as it includes as an absolute minimum writing every block of the disk and reading it back to confirm it’s correct (and a strong preference to do it at least twice with different patterns). If I’m doing it from the shell, I’m using badblocks, but there are no doubt dozens of ways it could be done; I’d leave it to someone more up on the relevant technology to determine the most appropriate way to implement it.
This is exactly the kind of addition to TrueNAS that I find useful and would actually use. I’ve been running servers for quite some time and currently operate two systems with 12 HDDs each. I’ve already been bitten twice by the bathtub curve, so having a built-in disk burn-in workflow in the TrueNAS GUI would be very helpful. I’ve added my vote — thanks to everyone working on this and supporting the request.
Reminder: badblocks was NOT designed to burn-in drives.
Badblocks is a legacy from the time when drives were dumb devices and could not even detect when they had grown bad sectors and manage the issue by themselves. It has been repurposed for burning in, depite being long past its “Best use before…” date.
I know people who buy used/refurbished hard drives which have a 3 to 5 year warranty and they do have failures. Might be within a few hours of a few days. It would be smart to perform some active test before putting them online for use. So, I agree with you.
Agreed - though I do like it for its new purpose, even if it is only because I’m too lazy to learn anything else or too stupid to imagine & implement a better solution.
Agreed.
We burn-in all our own disks. We generally find statistical approaches work with any current drives. We prefer the user (ourselves) can specify the time the tests will run for. Most people have finite time to get tasks done.
We recommend a crawl-walk-run approach. First, lets find a script and algorithm that the Community agrees is a good, efficient, test that DOES catch issues on single drives more often than SMART. Ideally, the script doesn’t care whether its an HDD or SSD. It should provide a simple report on test coverage and drive health. (my recommendation is that the user can specify the duration of the test in hours)
It’s another challenge to determine that the test is at all reliably predictive of future failures.
Our own test software doesn’t run in TrueNAS and is instead its own application on a dedicated OS.
If the testing is unrelated to ZFS, then any Linux developer can create a solution.
A long time ago, I thought about writing a random block reader or writer, tester program. This was intended for HDDs.
In read only mode, it would randomly select a group of sectors to read. Then perform the read and log any failure to read. Next, record this group of sectors in a bit map of the HDD such that we don’t bother re-reading that same group of sectors. Last, randomly select another group of sectors. If a sector group comes up that has already been read, simply use either the next sector group or prior sector group, until their are no more sector groups to read.
In write then read mode, it would randomly select a sector group to write. The data written would be based on the sector group’s initial offset, and otherwise unique but repeatable data. Log any write errors. Then record the write in a bit map of the sector groups. When fully written, perform the read, similar to the read only mode above, but comparing the actual data to what is expected.
Back when I thought of this, this was intended to test both the sectors AND head seeking of HDDs. However, with SSDs / NVMes we have a flash translation table which could, given bad firmware or internal RAM problems, write or return the wrong data. So random writes with known data and later checking could still be a valid test.
Now is this better than badblocks?
Probably not. I had the idea a LONG time ago. However, badblocks does not really exercise seeking. I mean it does seek to the next block to read or write, but does not do so extensively.
I think @dak180’s script, linked in my OP, does this. I don’t know how well-known it is, but it implements the same process that is pretty widely-recommended: short SMART, long SMART, badblocks, long SMART. A possible refinement (accounting for your goal of specifying the time the test will run for) would be to let the user specify fewer passes of badblocks than the default of four.
This is a little simplistic; if this process catches any errors, they would surface through SMART monitoring. Badblocks will report its own errors, but if it hits a bad sector, SMART will notice it and flag a pending/uncorrectable sector. But this process will force reads and writes to every sector (ideally, repeatedly), making sure SMART is able to evaluate every sector. It will also stress the disk for a while, pushing a marginal disk toward failure.
Not the goal, at least as I see it. The goal is to aggressively weed out bad or marginal disks before they’re put into production.
I’d guess it is a similar reason as to why memtest does different test patterns; hard to predict what kind of failure you’re going to get.
It should matter some. SSD/NVMe drives have a finite write factor to consider, so writing to every sector/block probably would not be good. This would mean that BadBlocks would not be the way to go. In this situation I’d just run a SMART Extended test, however I’m curious if that is the consensus. Even in Non-Destructive Mode, BadBlocks writes data, it just puts back what was originally in the sector.
Agreed.
How does this test SSDs? Just curious and if you use this self-made tool, would it be proper to include it? I would like to know exactly what it does when it tests SSDs.
I agree with @dan here, we should be looking for Infant Mortality in my opinion.
Again, I stress the importance of a different set/combination of tests for HDDs vs. SSDs.
But I like the Crawl aspect. I’m certain that @Captain_Morgan can find several volunteers here in the forums to test the heck out of whatever is proposed, before it is implemented into TrueNAS. Just provide us some Linux stuff and let a few of us have fun. It might be a good idea to generate a specific list of testing criteria, for example:
- Obtain a baseline, run
smartctl -x /dev/sdX > baseline_serialnumber.txt - Run test #1 and record the results.
- Run test #2 and record the results.
Of course, a fully automated script would likely be best to keep everything identical.
You have us, use us.
Agreed. Now, a single tool that would do both, choosing whichever is more appropriate for the devices in question, would be great.
It can catch read/write errors, and places a lot of stress on the drive, this is why we use it. It may pass fine and then you find out 600 hours later, you have an Infant Mortality issue. This is one of the best ways to try to weed out these problems but nothing is 100%. If it were, the largest rocket would have had no small issues, and you know there is quality control there. you do your best, that is all you can realistically ask for.
@Fleshmauler is exactly correct. The four test patterns are very specific as to test the magnetic properties of every bit. It is not about just writing and reading a word, it validates every bit of data.