Which drive is offline?

WDC WUH721414ALE604 9JHDHHLT Currently sda - Looks fine
WDC WUH721414ALE604 9JGJ0WVT Currently sdb - Looks fine
TOSHIBA MG07ACA14TEY Y840A0AJFFHG Currently sdc - 1 error in the log
TOSHIBA MG07ACA14TEY Y810A05JFFHG Currently sdd - 1 error in the log
WDC WUH721414ALE604 9JHMBEJT Currently sde - Looks fine
WDC WUH721414ALE604 9JGJ3UJT Currently sdf - Looks fine
TOSHIBA MG07ACA14TEY Y880A0C1FFHG Currently sdg - 1 error in the log
TOSHIBA MG07ACA14TEY Y890A08UFFHG Currently sdh - Looks fine
WDC WUH721414ALN6L4 9RH9DH5C Currently sdi - Looks fine Edit: Actually has 8 uncorrectable errors.
TOSHIBA MG07ACA14TEY Y880A07TFFHG Currently sdj - 148 UDMA CRC Errors, 149 errors recorded in total
WDC WUH721414ALN6L4 9RHUA20C Currently sdl - 3 Reallocated sectors, 65536 read errors, may be literal. Looks dodgy, investigate further.
TOSHIBA MG07ACA14TEY Y870A03ZFFHG Currently sdm - 1 error in the log
WDC WUH721414ALN6L4 9JH45NBT Currently sdn - Looks fine
WDC WUH721414ALN6L4 9RHPHNMC Currently sdo - 15 read errors, 29 pending sectors, 16 errors in the log

  • The drives with 1 error in the log are probably fine, at least unless that starts to climb. I suspect a single event (maybe a power failure? Caused that).
  • sdj may have a cable or connector issue, watch if the CRC errors continue to climb.
  • sdl especially but also sdo are suspect, read errors are a big warning sign, as are a growing number of reallocated sectors.

I’ll also note that it would be a good idea to run a long SMART test, especially on the suspect drives. It will take around a day to complete. A smart tests leverages the drives built-in self-diagnostic features and can help catch some issues before they result in a drive failure.

If a long test is reported as having failed it’s usually grounds for an RMA if the drive is under warranty.

Ideally you set up smart tests to run regularly on a schedule. A long test a month could be a good starting point, although some prefer to run them more frequently. A short test takes a minute or two but is also very limited in what it actually tests.

Thanks for summarizing all that.

So it seems like you think I might not need to swap out a drive yet based on the SMART data. Is there other data I should look at too, or is that the primary source?

Two other times I had notifications like this, I replaced the drives, and the companies replaced them under warranty, and I purchased cold spares.

How often do the letter assignments (sda, sdo, etc) change? Seems weird they would change often.

Won’t a test on a potentially problematic drive potentially cause a failure?

The smart report gives you a tally, at this point I would keep an eye on it. If the errors continue to climb on the same drives, they are failing.

It can happen every time the server boots. The fact that the notifications use these volatile devices names to identify potential problems is unfortunate.

It depends on how you look at it. A test may stress a drive such that it chokes. But if a smart test manages that, isn’t it better to know early so you can get it replaced?

Smart tests are routine and you are, as a sensible data hoarder, expected to use them as one of many tools to keep tabs on the health of your drives.

I think you’re the first netizen to call me sensible in almost a decade. What are you doing later? :kissing_heart: :rofl:

Getting back to the topic at hand, I guess I could put this back in the equipment closet and kick off some SMART tests. SDL and SDO first. Maybe I’ll leave it on my workbench and run the test so it’s easier with the names.

Do that long test on all the drives just to see where you’re at right now.
If a drive fails the long test, RMA if still available.

After that it would be good to schedule recurring SMART tests. Do so using Data ProtectionPeriodic S.M.A.R.T. Tests

For extra points, look into using something like the most excellent Multi-Report to help you keep tabs on the tests and if your SMART attributes are going up in a bad way. There have been a few issues with Multi-report in SCALE 24.10.1 specifically, but it’s developer joeschmuck has been hard at work resolving those and a new release is set to go stable in the near future. You can follow the development of the newest version in this thread.

Well, SDO failed an extended report, but doesn’t say much as to why.

Remaining: 0.9
Lifetime: 11334
Error: 3418095502

SDL is still going.

They don’t interfere if you run 2 at the same time, do they?

They can run concurrently just fine. A reboot would interrupt it, but that’s about it.

Reallocated sectors should be enough. If these were my drives, sdl and sdo would be either out to RMA or in the bin.

Potentially any time you reboot. Always track by serial number. (Complaints should be addressed at the Linux kernel maintainers…)

SDL succeeded its extended SMART test. I will replace SDO, then let it rebuild. Should I run a test on anything else? Or wait till it rebuilds and then replace SDL?

SDI is currently 9RH9DH5C
SDO is currently 9RHPHNMC
SDL is currently 9RHUA20C

Also, I have the following notifications:

Device: /dev/sdo [SAT], Self-Test Log error count increased from 0 to 1.
Jan 10, 2025 15:04:44 (America/New_York)
Device: /dev/sdo [SAT], 29 Currently unreadable (pending) sectors.
Jan 11, 2025 11:04:44 (America/New_York)
Device: /dev/sdi [SAT], 8 Currently unreadable (pending) sectors.
Jan 11, 2025 11:04:44 (America/New_York)
Device: /dev/sdi [SAT], 8 Offline uncorrectable sectors.
Jan 11, 2025 11:04:44 (America/New_York)

Schedule regular SMART tests on all drives.

That sdi is also showing bad sectors is worrying, unfortunately I misread that SMART report and there were actually 8 uncorrectable errors in the report you posted.

You are fast approaching a pool failure. The unavoidable resilvers may push you over the edge, it depends on if the data on your “working” drives is readable or not.

I replaced SDO with one of my cold spares, resilvered, reached out to the seller for an RMA, and sent off the drive.

I ran a long SMART test on 9RH9DH5C (what was above SDI, but is now SDJ) and it failed rather quickly, just like SDO above. Time to do it all over again.

Remaining: 0.9
Lifetime: 11429
Error: 3417497693