Feedback on faulted disk

Fr33dan · February 25, 2025, 5:40pm

Got a faulted disk message. Shows as having 28 read failures in the web ui. Went ahead and ordered a new one as I should have always had spares around just in case, but in the meantime since it took the drive offline I ran an offline extended smart test. Unless I’m misreading this the drive looks fine:

 root@Fr33dan-NAS[~]# smartctl -a /dev/sdf          
 smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.32-production+truenas] (local build)
 Copyright (C) 2002-23, Bruce Allen, Christian Franke, [Link Removed So I Could Post]
 
 === START OF INFORMATION SECTION ===
 Model Family:     Western Digital Red
 Device Model:     WDC WD40EFZX-68AWUN0
 Serial Number:    WD-WX82DA1PFA11
 LU WWN Device Id: 5 0014ee 214ce1f73
 Firmware Version: 81.00B81
 User Capacity:    4,000,787,030,016 bytes [4.00 TB]
 Sector Sizes:     512 bytes logical, 4096 bytes physical
 Rotation Rate:    5400 rpm
 Form Factor:      3.5 inches
 Device is:        In smartctl database 7.3/5528
 ATA Version is:   ACS-3 T13/2161-D revision 5
 SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
 Local Time is:    Tue Feb 25 12:27:01 2025 EST
 SMART support is: Available - device has SMART capability.
 SMART support is: Enabled
 
 === START OF READ SMART DATA SECTION ===
 SMART overall-health self-assessment test result: PASSED
 
 General SMART Values:
 Offline data collection status:  (0x00) Offline data collection activity
                                         was never started.
                                         Auto Offline Data Collection: Disabled.
 Self-test execution status:      (   0) The previous self-test routine completed
                                         without error or no self-test has ever 
                                         been run.
 Total time to complete Offline 
 data collection:                (43440) seconds.
 Offline data collection
 capabilities:                    (0x11) SMART execute Offline immediate.
                                         No Auto Offline data collection support.
                                         Suspend Offline collection upon new
                                         command.
                                         No Offline surface scan supported.
                                         Self-test supported.
                                         No Conveyance Self-test supported.
                                         No Selective Self-test supported.
 SMART capabilities:            (0x0003) Saves SMART data before entering
                                         power-saving mode.
                                         Supports SMART auto save timer.
 Error logging capability:        (0x01) Error logging supported.
                                         General Purpose Logging supported.
 Short self-test routine 
 recommended polling time:        (   2) minutes.
 Extended self-test routine
 recommended polling time:        ( 462) minutes.
 SCT capabilities:              (0x303d) SCT Status supported.
                                         SCT Error Recovery Control supported.
                                         SCT Feature Control supported.
                                         SCT Data Table supported.
 
 SMART Attributes Data Structure revision number: 16
 Vendor Specific SMART Attributes with Thresholds:
 ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
   1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
   3 Spin_Up_Time            0x0027   219   219   021    Pre-fail  Always       -       4050
   4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       56
   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
   7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
   9 Power_On_Hours          0x0032   065   065   000    Old_age   Always       -       25635
  10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
  12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       55
 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       30
 193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       85
 194 Temperature_Celsius     0x0022   123   105   000    Old_age   Always       -       27
 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
 197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
 198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0
 
 SMART Error Log Version: 1
 No Errors Logged
 
 SMART Self-test log structure revision number 1
 Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
 # 1  Extended offline    Completed without error       00%     25623         -
 # 2  Short offline       Completed without error       00%     25480         -
 # 3  Short offline       Completed without error       00%     25312         -
 # 4  Extended offline    Completed without error       00%     25282         -
 # 5  Short offline       Completed without error       00%     25144         -
 # 6  Short offline       Completed without error       00%     24976         -
 # 7  Short offline       Completed without error       00%     24808         -
 # 8  Short offline       Completed without error       00%     24640         -
 # 9  Extended offline    Completed without error       00%     24539         -
 #10  Short offline       Completed without error       00%     24473         -
 #11  Short offline       Completed without error       00%     24305         -
 #12  Short offline       Completed without error       00%     24137         -
 #13  Short offline       Completed without error       00%     23969         -
 #14  Short offline       Completed without error       00%     23801         -
 #15  Extended offline    Completed without error       00%     23796         -
 #16  Short offline       Completed without error       00%     23634         -
 #17  Short offline       Completed without error       00%     23466         -
 #18  Short offline       Completed without error       00%     23298         -
 #19  Short offline       Completed without error       00%     23130         -
 #20  Extended offline    Completed without error       00%     23076         -
 #21  Short offline       Completed without error       00%     22962         -
 
 Selective Self-tests/Logging not supported
 
 The above only provides legacy SMART information - try 'smartctl -x' for more

Mainly want an opinion of if I should be worried something else is faulty and investigate further, or just replace the drive and move on?

EDITED TO CORRECT FORMAT OF DATA

SmallBarky · February 25, 2025, 5:50pm

It helps to have a detailed, complete listing of your TrueNAS system and the OS version. How drives are attached, etc.

It makes the report more readable if you use Preformatted Text (</>) or Ctrl+e when you post from the CLI

Fr33dan · February 25, 2025, 7:04pm

Good tip about the formating, I’m not too familar with this style of forum. Really wish I could figure out how to see a preview before I post.

Yeah I know I shared barely any info, was just trying to see if the fridge smells bad from where y’all are standing to decided if I should clean it out. I guess I’ll extend that metaphor to interpret your response as “can’t tell from here, what’s in it” which I guess I should have seen coming.

Anywho enough rambling. I’m running SCALE Dragonfish-24.04.2.5. I’m running a total of 12 drives in 2 RAIDZ1 VDEVS. Was set up as one VDEV with 8 drives because I didn’t know what I was doing, and then later expanded with a second 4 drive VDEV. At the time of expansion time the drives were moved into a Dell MD1200. Been running in this configuration since January 2024. The faulted drive is one of the initial 8, been running since March 2022. This is getting lengthy and I think that’s everything relevent so I’ll stop but if more details are needed let me know.

SmallBarky · February 25, 2025, 7:28pm

Your SMART test looks decent. Drive hours are getting up there a bit and you have Raid-Z1 VDEVs so you can only have a single drive failure per VDEV or you lose the entire pool.

Your set up has a bit more complication with cables and MD1200 and you didn’t mention the HBA and if it was a plain HBA, flashed to IT Mode or RAID. I hope not on the RAID. Make sure all drive models are CMR and not SMR types for recording tech.
Do you have regular SMART Long tests scheduled? You can also consider setting up and running the Multi Report that a lot of users run.

Fleshmauler · February 25, 2025, 7:32pm

Disk looking fine to me (assuming this was a recent smart test), mind doing output of

zpool status

Might just need a scrub & clear imo…

Fr33dan · February 25, 2025, 7:49pm

I can’t find the details from when I ordered the HBA but it’s in the most do-nothing mode possible. IT Mode I guess? I’ve not heard the term but googling it that sounds like what I’m using where each drive is presented to the OS. Not trying to deal with a hardware RAID headache, just let the software handle the RAID. I did learn enough before purchasing to know to get CMR drives. Long tests every month. Will check out that multi-report!

zpool status:

  pool: fr33dan-raid
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: [Link Removed Again So Can Post]
  scan: scrub repaired 0B in 11:26:32 with 1 errors on Sun Feb  2 11:27:06 2025
config:

        NAME                                      STATE     READ WRITE CKSUM
        fr33dan-raid                              DEGRADED     0     0     0
          raidz1-0                                DEGRADED     0     0     0
            987c9bf6-56b2-4b30-9c12-55109955bb55  ONLINE       0     0     6
            9f34b8c6-1841-4840-925e-3920fbe820de  ONLINE       0     0     6
            ea31863d-b313-4776-940e-be6f730a8116  ONLINE       0     0     6
            48b88771-4980-4d67-acb2-36c98b09f69a  ONLINE       0     0     6
            4e3dd146-a7ac-4f6d-a7b6-036004da98b7  ONLINE       0     0     6
            5bbcbc88-a406-4c6b-b391-a4202cebfc6b  ONLINE       0     0     6
            989c8c80-be08-4267-abc3-f5646dd06de7  FAULTED     28     0     6  too many errors
            bace6a63-bc18-4657-a8bf-e7fdb05afa37  ONLINE       0     0     6
          raidz1-1                                ONLINE       0     0     0
            59eab28c-20b1-4311-948e-4ee246067284  ONLINE       0     0     0
            45a83ba4-6121-49c9-b79a-0be2f58b3082  ONLINE       0     0     0
            e3f0750f-2388-47dd-811b-71e185fb1a15  ONLINE       0     0     0
            513179e8-2bd2-482f-b2cc-b20f20ea3b95  ONLINE       0     0     0

Fleshmauler · February 25, 2025, 11:06pm

Interesting; errors on 8 drives. I’m wondering if the wiring from hba took a hit or maybe it was running hot.

You can check hba firmware with either sas2flash -list or sas3flash -list

Mind providing that output? I’m thinking it was a hit on the hba instead of a drive issue. You got a fan pointing at the thing?

joeschmuck · February 25, 2025, 11:26pm

Give this a try, I created it just for these situations…

If there is difficulty somewhere, please let me know and if you have a suggestion, provide it.

Fr33dan · February 26, 2025, 2:07am

Oh right the checksum errors. I’m pretty sure those are different, but I guess I should explain if I’m gonna claim that.

They have occured since very early on, even when the drives were connected though a PCI add-in card that adds SATA ports. One of the things I store is an active copy of my entire steam library. Kept on an SMB share and managed by a windows PC. These checksum errors occur sometimes when Steam updates a game while a scrub is occuring. I think it first occurred during my second regularly scheduled scrub (35 day threshold only on Sundays). Happens about every third scrub or so. I’ve learned to ignore it even though I probably shouldn’t.

Here’s the output of that command anyway though:

root@Fr33dan-NAS[~]# sas2flash -list
LSI Corporation SAS2 Flash Utility
Version 20.00.00.00 (2014.09.18) 
Copyright (c) 2008-2014 LSI Corporation. All rights reserved 

        Adapter Selected is a LSI SAS: SAS2008(B2)   

        Controller Number              : 0
        Controller                     : SAS2008(B2)   
        PCI Address                    : 00:03:00:00
        SAS Address                    : 500605b-0-0628-c9d0
        NVDATA Version (Default)       : 14.01.00.07
        NVDATA Version (Persistent)    : 14.01.00.07
        Firmware Product ID            : 0x2213 (IT)
        Firmware Version               : 20.00.07.00
        NVDATA Vendor                  : LSI
        NVDATA Product ID              : SAS9200-8e
        BIOS Version                   : 07.39.02.00
        UEFI BSD Version               : 07.27.01.01
        FCODE Version                  : N/A
        Board Name                     : SAS9200-8e
        Board Assembly                 : H3-25260-02C
        Board Tracer Number            : SP31519350

        Finished Processing Commands Successfully.
        Exiting SAS2Flash.

Fr33dan · February 26, 2025, 2:11am

Certainly looks in-depth. At a glance it’s very dense and intimidating. Just a formatting critique. The bit of it I’ve followed though makes a lot of sense in terms of the content/actions outlined being useful. I’ll take a deeper look later and let you know if I have more feedback.

joeschmuck · February 26, 2025, 2:57am

I completely understand. I plan to break it up into a few more pages in the next version. But it should be as simple as answering the questions.

EDIT: Out of curiosity, how did you determine the drive to run the SMART test on? I try not to assume anything.

I also think you need to run a zpool scrub and if you have no file issues, run a zpool clear, then monitor it. The flow chart should tell you that on the first chart.

Good luck

Fleshmauler · February 26, 2025, 3:14am

Seems you’re on the latest firmware. If you want you can run a scrub & then zpool clear to clear out the errors. If they keep coming back on that drive might be worth reseating connections… Issues still persist? Yeah I guess a swap is due.

Cooling on hba is always recommended, etc.

Fr33dan · February 26, 2025, 3:44am

zpool status shows only the drive UUID, but the web ui marks it as sdf.

Decided to pop the drive out to take a peak and reseat it. Cleared the pool and started a scrub. It pretty much immediately jumped to 58 errors and re-faulted/offlined the drive.

Replacement will be here Thursday. Got two so even after this I’ll have a cold spare on hand.

SmallBarky · February 26, 2025, 4:34am

Are you not getting a preview pane when replying on the right side? You should be able to scroll the preview window also.

joeschmuck · February 26, 2025, 5:01am

Geez, that was so easy. I forget the easy stuff and use the command line instead.

It is odd that the drive has caused the error a second time when you do not have anything obviously wrong with the drive.

Before replacing the drive, let’s prove it is the drive. Please do not take the way I am addressing this as a sign that you are clueless. I do my best to never assume we are communicating correctly so I painfully try to minimize this error. I have no idea what you know or think you know, or if I say “do this”, that you are doing this in the same manner as I would, so it is best to treat you as if you do not know and spell out each step. I do assume you can get to a Shell window.

On the CLI (Shell Window) enter zpool status -v and then post the entire output (yes all of it) using the preformatted text style.
Next enter lsblk -o +PARTUUID,NAME,LABEL,SERIAL and post that output as well. This of course cross-references the drive and include the serial number (good stuff to have).

That gives us the data we need to start with. Now lets run through it:

For the zpool status -v does it list any files are corrupt? This is of course important. If it does then you must delete those files before going any further. They are corrupt. Then run a zpool scrub fr33dan-raid and let it finish, repeat getting the zpool data, make sure it does not list any corrupt files. If you still have corrupt files listed (not the same as Write/Read/Cksum Errors), make sure you delete those files, try again. The scrub again. The goal is to not have any corrupt files listed.
Once there are no corrupt files listed, then run zpool clear fr33dan-raid and then zpool status -v and ensure all the errors are gone.
Last step if we get this far… Run zpool scrub fr33dan-raid one last time and then check it once completed with zpool status -v.

Like I said the values you report for the drive look very good. You could also post smartctl -x /dev/sdf (assuming sdf is still the drive identifier, it can and does change based on your hardware and which drive reports ready first (as best I understand it).

Bed time for this old man.

Fr33dan · February 26, 2025, 6:12am

As I mentioned, I already ran a zpool clear so any errors I had are already cleared so zpool status isn’t going to show much. Before I cleared it the only reported error was a file in the steam active installs. See post 9 in this thread. To add a more detail, at first I would always verify the game files in Steam with the intention to restore any corrupted file(s), but it would report no errors. Then do a zpool clear followed by a re-scrub that would come out fine. Eventually I started skipping the verify which then lead to skipping the re-scrub, until I finally just stopped doing anything about it and a random zpool status shows 6 CKSUM errors.

I did run it with -v before the clear though. There was only one actual file with an error though, reported 3 times because it was the file and it’s 2 snapshots (I intentionally have a low snapshot count on this dataset because of it’s frequent updates, large size and relative unimportance of the data). It was a Rocket League file, a common culprit of this error because it gets updated all the time.

zpool status (with and without -v) shows basically the same as before except with 58 errors now as mentioned and the scan updated to reflect the scrub I started (and the boot pool info):

root@Fr33dan-NAS[~]# zpool status -v
  pool: boot-pool
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:03:15 with 0 errors on Fri Feb 21 03:48:17 2025
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          sdm3      ONLINE       0     0     0

errors: No known data errors

  pool: fr33dan-raid
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub in progress since Tue Feb 25 21:46:40 2025
        7.45T / 30.3T scanned at 758M/s, 7.40T / 30.3T issued at 753M/s
        1020K repaired, 24.41% done, 08:52:13 to go
config:

        NAME                                      STATE     READ WRITE CKSUM
        fr33dan-raid                              DEGRADED     0     0     0
          raidz1-0                                DEGRADED     0     0     0
            987c9bf6-56b2-4b30-9c12-55109955bb55  ONLINE       0     0     0
            9f34b8c6-1841-4840-925e-3920fbe820de  ONLINE       0     0     0
            ea31863d-b313-4776-940e-be6f730a8116  ONLINE       0     0     0
            48b88771-4980-4d67-acb2-36c98b09f69a  ONLINE       0     0     0
            4e3dd146-a7ac-4f6d-a7b6-036004da98b7  ONLINE       0     0     0
            5bbcbc88-a406-4c6b-b391-a4202cebfc6b  ONLINE       0     0     0
            989c8c80-be08-4267-abc3-f5646dd06de7  FAULTED     58     0     0  too many errors
            bace6a63-bc18-4657-a8bf-e7fdb05afa37  ONLINE       0     0     0
          raidz1-1                                ONLINE       0     0     0
            59eab28c-20b1-4311-948e-4ee246067284  ONLINE       0     0     0
            45a83ba4-6121-49c9-b79a-0be2f58b3082  ONLINE       0     0     0
            e3f0750f-2388-47dd-811b-71e185fb1a15  ONLINE       0     0     0
            513179e8-2bd2-482f-b2cc-b20f20ea3b95  ONLINE       0     0     0

errors: No known data errors

lsblk was a nice way to confirm what the UI was showing as to which drive it is.

Oh! I guess when it was empty and I didn’t know what that space was I clicked the arrow to collapse it. Fantastic!

joeschmuck · February 26, 2025, 1:46pm

I still don’t see any issues with the drive itself so this is very strange and I try to look at it from a manufacturer perspective, what are the drive failing values. Replacing the drive may fix the issue but what is the issue?

You said you have had these problems with this for a very long time, but I didn’t see that you stated it was specifically always with this one drive (or I missed it), I assume that is the case but as I said before, don’t try to make assumptions. It is unlikely the HBA since before the HBA you had this problem.

You really have an odd problem and I am trying to figure out what the cause is and prove why. One reason I am trying to figure this out is because being the author or Multi-Report, if there is a different failure value to check, I’d like to include it in the script.

Can you post the output of smartctl -x /dev/sdf as it shows additional data about the drive. This seems to be presenting more like an SMR drive but you said you purchased the drive knowing about that issue, however I verified that the provided drive part number is in fact CMR, so that isn’t that. Again, no drive specific errors so the -x output may identify an issue, but I’m not holding my breath.

If that doesn’t show anything obvious, you can try this:
Enter smartctl -l scterc /dev/sdf and it will tell you your current SCT values. I expect it to tell you that they are set to Read and Write values of 70 (7.0 seconds) since WD Red drives have this set by default.

If they are “Disabled” then enter smartctl -l scterc,70,70 /dev/sdf to Enable and use the built in SCT. Then run the first command again to ensure the drive retained the values.

If you desire, you can increase the write value to 100 (10.0 seconds) as it appears the write operation is the problem (corrupt data).

This is normally a value to keep a drive from dropping out of the RAIDZ so I have no idea if it will work here.

If none of that works, I really hope replacing the drive solves the problem so you can put this all behind you.

On to a slightly different topic, once you have the problem fixed, do you have any plans to backup all your data and recreate the pool into a single RAIDZ2 or RAIDZ3 or some other layout, to adjust (I can’t say fix) the pool and VDEVs? Maybe you like it this way, that is perfectly fine as well, and I would not do this until after fixing the issue at hand first.

Best of luck.

Fr33dan · February 26, 2025, 4:22pm

So there are (what I think) are two seperate issues that I think you are mentally processing as one issue. The checksum errors are a longstanding issue that predates the HBA. I’ve always attributed the checksum errors to some kind of race condition where Steam updated a file in the middle of the scrub checking that specific file.

The read errors are new and resulted in the faulty drive message.

Here is extended smart output, this does show some errors but I don’t know what to make of the error info it’s giving me.:

root@Fr33dan-NAS[~]# smartctl -x /dev/sdf
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.32-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFZX-68AWUN0
Serial Number:    WD-WX82DA1PFA11
LU WWN Device Id: 5 0014ee 214ce1f73
Firmware Version: 81.00B81
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Feb 26 10:54:19 2025 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (43440) seconds.
Offline data collection
capabilities:                    (0x11) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 462) minutes.
SCT capabilities:              (0x303d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    0
  3 Spin_Up_Time            POS--K   218   218   021    -    4083
  4 Start_Stop_Count        -O--CK   100   100   000    -    57
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
  9 Power_On_Hours          -O--CK   065   065   000    -    25636
 10 Spin_Retry_Count        -O--CK   100   253   000    -    0
 11 Calibration_Retry_Count -O--CK   100   253   000    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    56
192 Power-Off_Retract_Count -O--CK   200   200   000    -    31
193 Load_Cycle_Count        -O--CK   200   200   000    -    88
194 Temperature_Celsius     -O---K   125   105   000    -    25
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   200   000    -    0
198 Offline_Uncorrectable   ----CK   100   253   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   200   200   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      6  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O      8  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa0-0xa7  GPL,SL  VS      16  Device vendor specific log
0xa8-0xb6  GPL,SL  VS       1  Device vendor specific log
0xb7       GPL,SL  VS      78  Device vendor specific log
0xbd       GPL,SL  VS       1  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xc1       GPL     VS      93  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 17
        CR     = Command Register
        FEATR  = Features Register
        COUNT  = Count (was: Sector Count) Register
        LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
        LH     = LBA High (was: Cylinder High) Register    ]   LBA
        LM     = LBA Mid (was: Cylinder Low) Register      ] Register
        LL     = LBA Low (was: Sector Number) Register     ]
        DV     = Device (was: Device/Head) Register
        DC     = Device Control Register
        ER     = Error register
        ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 17 [16] occurred at disk power-on lifetime: 25623 hours (1067 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 4d 5c 24 08 40 00  Error: UNC at LBA = 0x14d5c2408 = 5592851464

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 07 f8 00 00 00 01 4d 5c 24 08 40 00     00:12:22.959  READ FPDMA QUEUED
  60 07 f8 00 00 00 01 4d 5c 1c 10 40 00     00:12:22.951  READ FPDMA QUEUED
  60 07 e8 00 00 00 01 4d 5c 14 28 40 00     00:12:22.944  READ FPDMA QUEUED
  60 08 00 00 30 00 01 4d 5c 0c 28 40 00     00:12:22.937  READ FPDMA QUEUED
  60 00 28 00 08 00 01 c8 2b 35 98 40 00     00:12:22.937  READ FPDMA QUEUED

Error 16 [15] occurred at disk power-on lifetime: 25609 hours (1067 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 2b aa f9 80 40 00  Error: UNC at LBA = 0x2baaf980 = 732625280

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 20 00 00 00 00 2b aa f9 80 40 00 14d+09:17:37.749  READ FPDMA QUEUED
  60 00 28 00 00 00 00 ec d8 40 40 40 00 14d+09:17:37.605  READ FPDMA QUEUED
  60 00 28 00 00 00 01 49 a2 5f 98 40 00 14d+09:17:37.499  READ FPDMA QUEUED
  60 00 28 00 00 00 01 5c b2 52 48 40 00 14d+09:17:37.324  READ FPDMA QUEUED
  60 00 28 00 00 00 01 5c b2 52 20 40 00 14d+09:17:37.324  READ FPDMA QUEUED

Error 15 [14] occurred at disk power-on lifetime: 25608 hours (1067 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 08 00 01 c7 b1 f4 d0 40 00  Error: UNC at LBA = 0x1c7b1f4d0 = 7645295824

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 20 00 08 00 01 c7 b1 f4 d0 40 00 14d+09:10:38.909  READ FPDMA QUEUED
  60 01 58 00 00 00 01 c7 b1 a8 90 40 00 14d+09:10:38.909  READ FPDMA QUEUED
  60 01 30 00 10 00 01 c7 b1 a6 f0 40 00 14d+09:10:38.906  READ FPDMA QUEUED
  60 01 08 00 00 00 01 c7 b1 a5 98 40 00 14d+09:10:38.906  READ FPDMA QUEUED
  60 00 28 00 10 00 01 c7 b1 a6 08 40 00 14d+09:10:38.899  READ FPDMA QUEUED

Error 14 [13] occurred at disk power-on lifetime: 25607 hours (1066 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 08 00 01 69 cd e6 d8 40 00  Error: UNC at LBA = 0x169cde6d8 = 6070068952

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 20 00 08 00 01 69 cd e6 d8 40 00 14d+07:45:14.083  READ FPDMA QUEUED
  60 00 20 00 00 00 01 69 cd e5 b8 40 00 14d+07:45:14.083  READ FPDMA QUEUED
  60 00 28 00 00 00 01 69 cd e5 90 40 00 14d+07:45:14.083  READ FPDMA QUEUED
  ea 00 00 00 00 00 00 00 00 00 00 00 00 14d+07:45:13.991  FLUSH CACHE EXT
  61 00 18 00 00 00 00 af 8f 59 58 40 00 14d+07:45:13.991  WRITE FPDMA QUEUED

Error 13 [12] occurred at disk power-on lifetime: 25603 hours (1066 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 74 d0 3b 20 40 00  Error: UNC at LBA = 0x74d03b20 = 1959803680

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 28 00 00 00 00 74 d0 3b 20 40 00 14d+04:09:04.694  READ FPDMA QUEUED
  60 00 28 00 00 00 01 3f a3 a5 d8 40 00 14d+04:09:04.622  READ FPDMA QUEUED
  60 00 28 00 00 00 01 3f a3 54 40 40 00 14d+04:09:04.608  READ FPDMA QUEUED
  60 00 28 00 00 00 01 a6 3e f9 f0 40 00 14d+04:09:04.537  READ FPDMA QUEUED
  60 00 28 00 08 00 01 a6 3e f9 80 40 00 14d+04:09:04.537  READ FPDMA QUEUED

Error 12 [11] occurred at disk power-on lifetime: 25603 hours (1066 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 a3 19 c6 b0 40 00  Error: UNC at LBA = 0xa319c6b0 = 2736375472

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 20 00 00 00 00 a3 19 c6 b0 40 00 14d+03:30:09.407  READ FPDMA QUEUED
  60 00 28 00 00 00 00 a3 19 c6 88 40 00 14d+03:30:09.249  READ FPDMA QUEUED
  60 00 20 00 00 00 00 e5 ca 91 70 40 00 14d+03:30:09.170  READ FPDMA QUEUED
  60 00 28 00 08 00 01 8a 7f b3 a8 40 00 14d+03:30:08.869  READ FPDMA QUEUED
  60 00 20 00 00 00 01 8a 7f b5 00 40 00 14d+03:30:08.869  READ FPDMA QUEUED

Error 11 [10] occurred at disk power-on lifetime: 25601 hours (1066 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 6c c3 57 90 40 00  Error: UNC at LBA = 0x6cc35790 = 1824741264

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 20 00 00 00 00 6c c3 57 90 40 00 14d+02:05:24.445  READ FPDMA QUEUED
  60 00 28 00 00 00 00 6c c3 57 40 40 00 14d+02:05:24.439  READ FPDMA QUEUED
  60 00 28 00 00 00 00 6c c3 57 68 40 00 14d+02:05:24.433  READ FPDMA QUEUED
  60 00 20 00 00 00 00 6c c3 56 f8 40 00 14d+02:05:24.420  READ FPDMA QUEUED
  60 00 28 00 00 00 00 6c c3 56 d0 40 00 14d+02:05:24.413  READ FPDMA QUEUED

Error 10 [9] occurred at disk power-on lifetime: 25599 hours (1066 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 42 7e d6 08 40 00  Error: UNC at LBA = 0x427ed608 = 1115608584

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 20 00 00 00 00 42 7e d6 08 40 00 13d+23:55:21.329  READ FPDMA QUEUED
  60 00 28 00 00 00 01 94 30 10 f8 40 00 13d+23:55:21.322  READ FPDMA QUEUED
  60 00 28 00 00 00 00 db 3f bf 48 40 00 13d+23:55:21.313  READ FPDMA QUEUED
  60 00 28 00 00 00 01 94 30 10 10 40 00 13d+23:55:21.166  READ FPDMA QUEUED
  60 00 28 00 00 00 01 ae 2d ef b8 40 00 13d+23:55:21.152  READ FPDMA QUEUED

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     25625         -
# 2  Extended offline    Completed without error       00%     25623         -
# 3  Short offline       Completed without error       00%     25480         -
# 4  Short offline       Completed without error       00%     25312         -
# 5  Extended offline    Completed without error       00%     25282         -
# 6  Short offline       Completed without error       00%     25144         -
# 7  Short offline       Completed without error       00%     24976         -
# 8  Short offline       Completed without error       00%     24808         -
# 9  Short offline       Completed without error       00%     24640         -
#10  Extended offline    Completed without error       00%     24539         -
#11  Short offline       Completed without error       00%     24473         -
#12  Short offline       Completed without error       00%     24305         -
#13  Short offline       Completed without error       00%     24137         -
#14  Short offline       Completed without error       00%     23969         -
#15  Short offline       Completed without error       00%     23801         -
#16  Extended offline    Completed without error       00%     23796         -
#17  Short offline       Completed without error       00%     23634         -
#18  Short offline       Completed without error       00%     23466         -

Selective Self-tests/Logging not supported

SCT Status Version:                  3
SCT Version (vendor specific):       258 (0x0102)
Device State:                        Active (0)
Current Temperature:                    25 Celsius
Power Cycle Min/Max Temperature:     24/28 Celsius
Lifetime    Min/Max Temperature:     22/45 Celsius
Under/Over Temperature Limit Count:   0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/65 Celsius
Min/Max Temperature Limit:           -41/85 Celsius
Temperature History Size (Index):    478 (295)

Index    Estimated Time   Temperature Celsius
 296    2025-02-26 02:57    24  *****
 ...    ..(144 skipped).    ..  *****
 441    2025-02-26 05:22    24  *****
 442    2025-02-26 05:23    25  ******
 ...    ..( 46 skipped).    ..  ******
  11    2025-02-26 06:10    25  ******
  12    2025-02-26 06:11    26  *******
 ...    ..( 94 skipped).    ..  *******
 107    2025-02-26 07:46    26  *******
 108    2025-02-26 07:47    25  ******
 ...    ..(181 skipped).    ..  ******
 290    2025-02-26 10:49    25  ******
 291    2025-02-26 10:50    24  *****
 ...    ..(  3 skipped).    ..  *****
 295    2025-02-26 10:54    24  *****

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4              56  ---  Lifetime Power-On Resets
0x01  0x010  4           25636  ---  Power-on Hours
0x01  0x018  6     40795032900  ---  Logical Sectors Written
0x01  0x020  6      1534462615  ---  Number of Write Commands
0x01  0x028  6    253134394217  ---  Logical Sectors Read
0x01  0x030  6      1956678072  ---  Number of Read Commands
0x01  0x038  6      2095286784  ---  Date and Time TimeStamp
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4           25515  ---  Spindle Motor Power-on Hours
0x03  0x010  4           25490  ---  Head Flying Hours
0x03  0x018  4             120  ---  Head Load Events
0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4               0  ---  Read Recovery Attempts
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
0x03  0x038  4               0  ---  Number of Realloc. Candidate Logical Sectors
0x03  0x040  4              31  ---  Number of High Priority Unload Events
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4              17  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               0  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              26  ---  Current Temperature
0x05  0x010  1              27  ---  Average Short Term Temperature
0x05  0x018  1              26  ---  Average Long Term Temperature
0x05  0x020  1              44  ---  Highest Temperature
0x05  0x028  1              22  ---  Lowest Temperature
0x05  0x030  1              43  ---  Highest Average Short Term Temperature
0x05  0x038  1              24  ---  Lowest Average Short Term Temperature
0x05  0x040  1              38  ---  Highest Average Long Term Temperature
0x05  0x048  1              25  ---  Lowest Average Long Term Temperature
0x05  0x050  4               0  ---  Time in Over-Temperature
0x05  0x058  1              65  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4             563  ---  Number of Hardware Resets
0x06  0x010  4             179  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            1  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            2  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x8000  4        47593  Vendor specific

Only if you attribute the checksum errors to the drive. If they are seperate things then the only error presenting related specifically to the drive are read errors. smartctl -l scterc /dev/sdf shows the expected default of 70.

I’ve been thinking of reorganizing the pool for awhile now. Everything is already backed up in AWS Glacier Deep backup so restoring would not be quick thus I’ve been putting it off at least 8 months (backup was intended as a failsafe, not transitional storage). It’s more data than I could reasonable obtain secondary storage for locally, honest I may never get to it. My highest priority is expandability. When my pool get full I want a way to add more drives and extend the existing organization I’ve got rather than add new spaces. When I set it up I stupidly thought I could add drives to a VDEV or switch to RAIDZ2/3 later and thus the best course was one giant VDEV.

Now my what I think would be ideal for me is doing everything in 4 drive RAIDZ1 VDEVS (a decision made before the first expansion). I think the risk of having 2 out of 4 drives in a given VDEV is acceptable, and it would suit my requirements of being expandable.

Pool is at 72% capacity I’m gonna need another expansion soon so this is actually presient and I’m open to suggestions.

joeschmuck · February 26, 2025, 5:59pm

The extended data tells me that you have had 17 Uncorrectable Read Errors for the lifetime of the drive.

0x04  0x008  4              17  ---  Number of Reported Uncorrectable Errors

I can’t explain why these were not echoed in the ID 198 value.
The code basically says that HEAD #16 (always this head) is having an uncorrectable bit while reading many different sectors on the disk. It is not defined to a specific area. But there are not 16 heads and this is a holdover from the old days, 16 logical heads.

So what I am saying is, it does look like your drive is bad, but I still don’t understand why it was not in the standard data section.

As for redoing your pool, you could think about just backing up the data you absolutely need on hand, then rebuild, copy that data back to make your system operational, finally start the very long restoration of all the other data from the off site location. That is the only easy option I can see when someone has a huge amount of data.

etorix · February 26, 2025, 6:09pm

Could it be that #198 is for WRITE errors, while these events were the mythical Uncorrectable READ Error, of “RAID 5 is dead” fame?

Anyway, the time to replace this drive has come.