Got a faulted disk message. Shows as having 28 read failures in the web ui. Went ahead and ordered a new one as I should have always had spares around just in case, but in the meantime since it took the drive offline I ran an offline extended smart test. Unless I’m misreading this the drive looks fine:
root@Fr33dan-NAS[~]# smartctl -a /dev/sdf
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.32-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, [Link Removed So I Could Post]
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red
Device Model: WDC WD40EFZX-68AWUN0
Serial Number: WD-WX82DA1PFA11
LU WWN Device Id: 5 0014ee 214ce1f73
Firmware Version: 81.00B81
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Form Factor: 3.5 inches
Device is: In smartctl database 7.3/5528
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Feb 25 12:27:01 2025 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (43440) seconds.
Offline data collection
capabilities: (0x11) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 462) minutes.
SCT capabilities: (0x303d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 219 219 021 Pre-fail Always - 4050
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 56
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 065 065 000 Old_age Always - 25635
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 55
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 30
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 85
194 Temperature_Celsius 0x0022 123 105 000 Old_age Always - 27
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 25623 -
# 2 Short offline Completed without error 00% 25480 -
# 3 Short offline Completed without error 00% 25312 -
# 4 Extended offline Completed without error 00% 25282 -
# 5 Short offline Completed without error 00% 25144 -
# 6 Short offline Completed without error 00% 24976 -
# 7 Short offline Completed without error 00% 24808 -
# 8 Short offline Completed without error 00% 24640 -
# 9 Extended offline Completed without error 00% 24539 -
#10 Short offline Completed without error 00% 24473 -
#11 Short offline Completed without error 00% 24305 -
#12 Short offline Completed without error 00% 24137 -
#13 Short offline Completed without error 00% 23969 -
#14 Short offline Completed without error 00% 23801 -
#15 Extended offline Completed without error 00% 23796 -
#16 Short offline Completed without error 00% 23634 -
#17 Short offline Completed without error 00% 23466 -
#18 Short offline Completed without error 00% 23298 -
#19 Short offline Completed without error 00% 23130 -
#20 Extended offline Completed without error 00% 23076 -
#21 Short offline Completed without error 00% 22962 -
Selective Self-tests/Logging not supported
The above only provides legacy SMART information - try 'smartctl -x' for more
Mainly want an opinion of if I should be worried something else is faulty and investigate further, or just replace the drive and move on?
Good tip about the formating, I’m not too familar with this style of forum. Really wish I could figure out how to see a preview before I post.
Yeah I know I shared barely any info, was just trying to see if the fridge smells bad from where y’all are standing to decided if I should clean it out. I guess I’ll extend that metaphor to interpret your response as “can’t tell from here, what’s in it” which I guess I should have seen coming.
Anywho enough rambling. I’m running SCALE Dragonfish-24.04.2.5. I’m running a total of 12 drives in 2 RAIDZ1 VDEVS. Was set up as one VDEV with 8 drives because I didn’t know what I was doing, and then later expanded with a second 4 drive VDEV. At the time of expansion time the drives were moved into a Dell MD1200. Been running in this configuration since January 2024. The faulted drive is one of the initial 8, been running since March 2022. This is getting lengthy and I think that’s everything relevent so I’ll stop but if more details are needed let me know.
Your SMART test looks decent. Drive hours are getting up there a bit and you have Raid-Z1 VDEVs so you can only have a single drive failure per VDEV or you lose the entire pool.
Your set up has a bit more complication with cables and MD1200 and you didn’t mention the HBA and if it was a plain HBA, flashed to IT Mode or RAID. I hope not on the RAID. Make sure all drive models are CMR and not SMR types for recording tech.
Do you have regular SMART Long tests scheduled? You can also consider setting up and running the Multi Report that a lot of users run.
I can’t find the details from when I ordered the HBA but it’s in the most do-nothing mode possible. IT Mode I guess? I’ve not heard the term but googling it that sounds like what I’m using where each drive is presented to the OS. Not trying to deal with a hardware RAID headache, just let the software handle the RAID. I did learn enough before purchasing to know to get CMR drives. Long tests every month. Will check out that multi-report!
zpool status:
pool: fr33dan-raid
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: [Link Removed Again So Can Post]
scan: scrub repaired 0B in 11:26:32 with 1 errors on Sun Feb 2 11:27:06 2025
config:
NAME STATE READ WRITE CKSUM
fr33dan-raid DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
987c9bf6-56b2-4b30-9c12-55109955bb55 ONLINE 0 0 6
9f34b8c6-1841-4840-925e-3920fbe820de ONLINE 0 0 6
ea31863d-b313-4776-940e-be6f730a8116 ONLINE 0 0 6
48b88771-4980-4d67-acb2-36c98b09f69a ONLINE 0 0 6
4e3dd146-a7ac-4f6d-a7b6-036004da98b7 ONLINE 0 0 6
5bbcbc88-a406-4c6b-b391-a4202cebfc6b ONLINE 0 0 6
989c8c80-be08-4267-abc3-f5646dd06de7 FAULTED 28 0 6 too many errors
bace6a63-bc18-4657-a8bf-e7fdb05afa37 ONLINE 0 0 6
raidz1-1 ONLINE 0 0 0
59eab28c-20b1-4311-948e-4ee246067284 ONLINE 0 0 0
45a83ba4-6121-49c9-b79a-0be2f58b3082 ONLINE 0 0 0
e3f0750f-2388-47dd-811b-71e185fb1a15 ONLINE 0 0 0
513179e8-2bd2-482f-b2cc-b20f20ea3b95 ONLINE 0 0 0
Oh right the checksum errors. I’m pretty sure those are different, but I guess I should explain if I’m gonna claim that.
They have occured since very early on, even when the drives were connected though a PCI add-in card that adds SATA ports. One of the things I store is an active copy of my entire steam library. Kept on an SMB share and managed by a windows PC. These checksum errors occur sometimes when Steam updates a game while a scrub is occuring. I think it first occurred during my second regularly scheduled scrub (35 day threshold only on Sundays). Happens about every third scrub or so. I’ve learned to ignore it even though I probably shouldn’t.
Here’s the output of that command anyway though:
root@Fr33dan-NAS[~]# sas2flash -list
LSI Corporation SAS2 Flash Utility
Version 20.00.00.00 (2014.09.18)
Copyright (c) 2008-2014 LSI Corporation. All rights reserved
Adapter Selected is a LSI SAS: SAS2008(B2)
Controller Number : 0
Controller : SAS2008(B2)
PCI Address : 00:03:00:00
SAS Address : 500605b-0-0628-c9d0
NVDATA Version (Default) : 14.01.00.07
NVDATA Version (Persistent) : 14.01.00.07
Firmware Product ID : 0x2213 (IT)
Firmware Version : 20.00.07.00
NVDATA Vendor : LSI
NVDATA Product ID : SAS9200-8e
BIOS Version : 07.39.02.00
UEFI BSD Version : 07.27.01.01
FCODE Version : N/A
Board Name : SAS9200-8e
Board Assembly : H3-25260-02C
Board Tracer Number : SP31519350
Finished Processing Commands Successfully.
Exiting SAS2Flash.
Certainly looks in-depth. At a glance it’s very dense and intimidating. Just a formatting critique. The bit of it I’ve followed though makes a lot of sense in terms of the content/actions outlined being useful. I’ll take a deeper look later and let you know if I have more feedback.
I completely understand. I plan to break it up into a few more pages in the next version. But it should be as simple as answering the questions.
EDIT: Out of curiosity, how did you determine the drive to run the SMART test on? I try not to assume anything.
I also think you need to run a zpool scrub and if you have no file issues, run a zpool clear, then monitor it. The flow chart should tell you that on the first chart.
Seems you’re on the latest firmware. If you want you can run a scrub & then zpool clear to clear out the errors. If they keep coming back on that drive might be worth reseating connections… Issues still persist? Yeah I guess a swap is due.
zpool status shows only the drive UUID, but the web ui marks it as sdf.
Decided to pop the drive out to take a peak and reseat it. Cleared the pool and started a scrub. It pretty much immediately jumped to 58 errors and re-faulted/offlined the drive.
Replacement will be here Thursday. Got two so even after this I’ll have a cold spare on hand.
Geez, that was so easy. I forget the easy stuff and use the command line instead.
It is odd that the drive has caused the error a second time when you do not have anything obviously wrong with the drive.
Before replacing the drive, let’s prove it is the drive. Please do not take the way I am addressing this as a sign that you are clueless. I do my best to never assume we are communicating correctly so I painfully try to minimize this error. I have no idea what you know or think you know, or if I say “do this”, that you are doing this in the same manner as I would, so it is best to treat you as if you do not know and spell out each step. I do assume you can get to a Shell window.
On the CLI (Shell Window) enter zpool status -v and then post the entire output (yes all of it) using the preformatted text style.
Next enter lsblk -o +PARTUUID,NAME,LABEL,SERIAL and post that output as well. This of course cross-references the drive and include the serial number (good stuff to have).
That gives us the data we need to start with. Now lets run through it:
For the zpool status -v does it list any files are corrupt? This is of course important. If it does then you must delete those files before going any further. They are corrupt. Then run a zpool scrub fr33dan-raid and let it finish, repeat getting the zpool data, make sure it does not list any corrupt files. If you still have corrupt files listed (not the same as Write/Read/Cksum Errors), make sure you delete those files, try again. The scrub again. The goal is to not have any corrupt files listed.
Once there are no corrupt files listed, then run zpool clear fr33dan-raid and then zpool status -v and ensure all the errors are gone.
Last step if we get this far… Run zpool scrub fr33dan-raid one last time and then check it once completed with zpool status -v.
Like I said the values you report for the drive look very good. You could also post smartctl -x /dev/sdf (assuming sdf is still the drive identifier, it can and does change based on your hardware and which drive reports ready first (as best I understand it).
As I mentioned, I already ran a zpool clear so any errors I had are already cleared so zpool status isn’t going to show much. Before I cleared it the only reported error was a file in the steam active installs. See post 9 in this thread. To add a more detail, at first I would always verify the game files in Steam with the intention to restore any corrupted file(s), but it would report no errors. Then do a zpool clear followed by a re-scrub that would come out fine. Eventually I started skipping the verify which then lead to skipping the re-scrub, until I finally just stopped doing anything about it and a random zpool status shows 6 CKSUM errors.
I did run it with -v before the clear though. There was only one actual file with an error though, reported 3 times because it was the file and it’s 2 snapshots (I intentionally have a low snapshot count on this dataset because of it’s frequent updates, large size and relative unimportance of the data). It was a Rocket League file, a common culprit of this error because it gets updated all the time.
zpool status (with and without -v) shows basically the same as before except with 58 errors now as mentioned and the scan updated to reflect the scrub I started (and the boot pool info):
root@Fr33dan-NAS[~]# zpool status -v
pool: boot-pool
state: ONLINE
status: Some supported and requested features are not enabled on the pool.
The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: scrub repaired 0B in 00:03:15 with 0 errors on Fri Feb 21 03:48:17 2025
config:
NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
sdm3 ONLINE 0 0 0
errors: No known data errors
pool: fr33dan-raid
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: scrub in progress since Tue Feb 25 21:46:40 2025
7.45T / 30.3T scanned at 758M/s, 7.40T / 30.3T issued at 753M/s
1020K repaired, 24.41% done, 08:52:13 to go
config:
NAME STATE READ WRITE CKSUM
fr33dan-raid DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
987c9bf6-56b2-4b30-9c12-55109955bb55 ONLINE 0 0 0
9f34b8c6-1841-4840-925e-3920fbe820de ONLINE 0 0 0
ea31863d-b313-4776-940e-be6f730a8116 ONLINE 0 0 0
48b88771-4980-4d67-acb2-36c98b09f69a ONLINE 0 0 0
4e3dd146-a7ac-4f6d-a7b6-036004da98b7 ONLINE 0 0 0
5bbcbc88-a406-4c6b-b391-a4202cebfc6b ONLINE 0 0 0
989c8c80-be08-4267-abc3-f5646dd06de7 FAULTED 58 0 0 too many errors
bace6a63-bc18-4657-a8bf-e7fdb05afa37 ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
59eab28c-20b1-4311-948e-4ee246067284 ONLINE 0 0 0
45a83ba4-6121-49c9-b79a-0be2f58b3082 ONLINE 0 0 0
e3f0750f-2388-47dd-811b-71e185fb1a15 ONLINE 0 0 0
513179e8-2bd2-482f-b2cc-b20f20ea3b95 ONLINE 0 0 0
errors: No known data errors
lsblk was a nice way to confirm what the UI was showing as to which drive it is.
Oh! I guess when it was empty and I didn’t know what that space was I clicked the arrow to collapse it. Fantastic!
I still don’t see any issues with the drive itself so this is very strange and I try to look at it from a manufacturer perspective, what are the drive failing values. Replacing the drive may fix the issue but what is the issue?
You said you have had these problems with this for a very long time, but I didn’t see that you stated it was specifically always with this one drive (or I missed it), I assume that is the case but as I said before, don’t try to make assumptions. It is unlikely the HBA since before the HBA you had this problem.
You really have an odd problem and I am trying to figure out what the cause is and prove why. One reason I am trying to figure this out is because being the author or Multi-Report, if there is a different failure value to check, I’d like to include it in the script.
Can you post the output of smartctl -x /dev/sdf as it shows additional data about the drive. This seems to be presenting more like an SMR drive but you said you purchased the drive knowing about that issue, however I verified that the provided drive part number is in fact CMR, so that isn’t that. Again, no drive specific errors so the -x output may identify an issue, but I’m not holding my breath.
If that doesn’t show anything obvious, you can try this:
Enter smartctl -l scterc /dev/sdf and it will tell you your current SCT values. I expect it to tell you that they are set to Read and Write values of 70 (7.0 seconds) since WD Red drives have this set by default.
If they are “Disabled” then enter smartctl -l scterc,70,70 /dev/sdf to Enable and use the built in SCT. Then run the first command again to ensure the drive retained the values.
If you desire, you can increase the write value to 100 (10.0 seconds) as it appears the write operation is the problem (corrupt data).
This is normally a value to keep a drive from dropping out of the RAIDZ so I have no idea if it will work here.
If none of that works, I really hope replacing the drive solves the problem so you can put this all behind you.
On to a slightly different topic, once you have the problem fixed, do you have any plans to backup all your data and recreate the pool into a single RAIDZ2 or RAIDZ3 or some other layout, to adjust (I can’t say fix) the pool and VDEVs? Maybe you like it this way, that is perfectly fine as well, and I would not do this until after fixing the issue at hand first.
So there are (what I think) are two seperate issues that I think you are mentally processing as one issue. The checksum errors are a longstanding issue that predates the HBA. I’ve always attributed the checksum errors to some kind of race condition where Steam updated a file in the middle of the scrub checking that specific file.
The read errors are new and resulted in the faulty drive message.
Here is extended smart output, this does show some errors but I don’t know what to make of the error info it’s giving me.:
root@Fr33dan-NAS[~]# smartctl -x /dev/sdf
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.32-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red
Device Model: WDC WD40EFZX-68AWUN0
Serial Number: WD-WX82DA1PFA11
LU WWN Device Id: 5 0014ee 214ce1f73
Firmware Version: 81.00B81
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Form Factor: 3.5 inches
Device is: In smartctl database 7.3/5528
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed Feb 26 10:54:19 2025 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is: Unavailable
APM feature is: Unavailable
Rd look-ahead is: Enabled
Write cache is: Enabled
DSN feature is: Unavailable
ATA Security is: Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (43440) seconds.
Offline data collection
capabilities: (0x11) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 462) minutes.
SCT capabilities: (0x303d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-K 200 200 051 - 0
3 Spin_Up_Time POS--K 218 218 021 - 4083
4 Start_Stop_Count -O--CK 100 100 000 - 57
5 Reallocated_Sector_Ct PO--CK 200 200 140 - 0
7 Seek_Error_Rate -OSR-K 200 200 000 - 0
9 Power_On_Hours -O--CK 065 065 000 - 25636
10 Spin_Retry_Count -O--CK 100 253 000 - 0
11 Calibration_Retry_Count -O--CK 100 253 000 - 0
12 Power_Cycle_Count -O--CK 100 100 000 - 56
192 Power-Off_Retract_Count -O--CK 200 200 000 - 31
193 Load_Cycle_Count -O--CK 200 200 000 - 88
194 Temperature_Celsius -O---K 125 105 000 - 25
196 Reallocated_Event_Count -O--CK 200 200 000 - 0
197 Current_Pending_Sector -O--CK 200 200 000 - 0
198 Offline_Uncorrectable ----CK 100 253 000 - 0
199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0
200 Multi_Zone_Error_Rate ---R-- 200 200 000 - 0
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
General Purpose Log Directory Version 1
SMART Log Directory Version 1 [multi-sector log support]
Address Access R/W Size Description
0x00 GPL,SL R/O 1 Log Directory
0x01 SL R/O 1 Summary SMART error log
0x02 SL R/O 5 Comprehensive SMART error log
0x03 GPL R/O 6 Ext. Comprehensive SMART error log
0x04 GPL,SL R/O 8 Device Statistics log
0x06 SL R/O 1 SMART self-test log
0x07 GPL R/O 1 Extended self-test log
0x09 SL R/W 1 Selective self-test log
0x10 GPL R/O 1 NCQ Command Error log
0x11 GPL R/O 1 SATA Phy Event Counters log
0x30 GPL,SL R/O 9 IDENTIFY DEVICE data log
0x80-0x9f GPL,SL R/W 16 Host vendor specific log
0xa0-0xa7 GPL,SL VS 16 Device vendor specific log
0xa8-0xb6 GPL,SL VS 1 Device vendor specific log
0xb7 GPL,SL VS 78 Device vendor specific log
0xbd GPL,SL VS 1 Device vendor specific log
0xc0 GPL,SL VS 1 Device vendor specific log
0xc1 GPL VS 93 Device vendor specific log
0xe0 GPL,SL R/W 1 SCT Command/Status
0xe1 GPL,SL R/W 1 SCT Data Transfer
SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 17
CR = Command Register
FEATR = Features Register
COUNT = Count (was: Sector Count) Register
LBA_48 = Upper bytes of LBA High/Mid/Low Registers ] ATA-8
LH = LBA High (was: Cylinder High) Register ] LBA
LM = LBA Mid (was: Cylinder Low) Register ] Register
LL = LBA Low (was: Sector Number) Register ]
DV = Device (was: Device/Head) Register
DC = Device Control Register
ER = Error register
ST = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 17 [16] occurred at disk power-on lifetime: 25623 hours (1067 days + 15 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 00 00 01 4d 5c 24 08 40 00 Error: UNC at LBA = 0x14d5c2408 = 5592851464
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 07 f8 00 00 00 01 4d 5c 24 08 40 00 00:12:22.959 READ FPDMA QUEUED
60 07 f8 00 00 00 01 4d 5c 1c 10 40 00 00:12:22.951 READ FPDMA QUEUED
60 07 e8 00 00 00 01 4d 5c 14 28 40 00 00:12:22.944 READ FPDMA QUEUED
60 08 00 00 30 00 01 4d 5c 0c 28 40 00 00:12:22.937 READ FPDMA QUEUED
60 00 28 00 08 00 01 c8 2b 35 98 40 00 00:12:22.937 READ FPDMA QUEUED
Error 16 [15] occurred at disk power-on lifetime: 25609 hours (1067 days + 1 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 00 00 00 2b aa f9 80 40 00 Error: UNC at LBA = 0x2baaf980 = 732625280
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 00 20 00 00 00 00 2b aa f9 80 40 00 14d+09:17:37.749 READ FPDMA QUEUED
60 00 28 00 00 00 00 ec d8 40 40 40 00 14d+09:17:37.605 READ FPDMA QUEUED
60 00 28 00 00 00 01 49 a2 5f 98 40 00 14d+09:17:37.499 READ FPDMA QUEUED
60 00 28 00 00 00 01 5c b2 52 48 40 00 14d+09:17:37.324 READ FPDMA QUEUED
60 00 28 00 00 00 01 5c b2 52 20 40 00 14d+09:17:37.324 READ FPDMA QUEUED
Error 15 [14] occurred at disk power-on lifetime: 25608 hours (1067 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 08 00 01 c7 b1 f4 d0 40 00 Error: UNC at LBA = 0x1c7b1f4d0 = 7645295824
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 00 20 00 08 00 01 c7 b1 f4 d0 40 00 14d+09:10:38.909 READ FPDMA QUEUED
60 01 58 00 00 00 01 c7 b1 a8 90 40 00 14d+09:10:38.909 READ FPDMA QUEUED
60 01 30 00 10 00 01 c7 b1 a6 f0 40 00 14d+09:10:38.906 READ FPDMA QUEUED
60 01 08 00 00 00 01 c7 b1 a5 98 40 00 14d+09:10:38.906 READ FPDMA QUEUED
60 00 28 00 10 00 01 c7 b1 a6 08 40 00 14d+09:10:38.899 READ FPDMA QUEUED
Error 14 [13] occurred at disk power-on lifetime: 25607 hours (1066 days + 23 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 08 00 01 69 cd e6 d8 40 00 Error: UNC at LBA = 0x169cde6d8 = 6070068952
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 00 20 00 08 00 01 69 cd e6 d8 40 00 14d+07:45:14.083 READ FPDMA QUEUED
60 00 20 00 00 00 01 69 cd e5 b8 40 00 14d+07:45:14.083 READ FPDMA QUEUED
60 00 28 00 00 00 01 69 cd e5 90 40 00 14d+07:45:14.083 READ FPDMA QUEUED
ea 00 00 00 00 00 00 00 00 00 00 00 00 14d+07:45:13.991 FLUSH CACHE EXT
61 00 18 00 00 00 00 af 8f 59 58 40 00 14d+07:45:13.991 WRITE FPDMA QUEUED
Error 13 [12] occurred at disk power-on lifetime: 25603 hours (1066 days + 19 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 00 00 00 74 d0 3b 20 40 00 Error: UNC at LBA = 0x74d03b20 = 1959803680
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 00 28 00 00 00 00 74 d0 3b 20 40 00 14d+04:09:04.694 READ FPDMA QUEUED
60 00 28 00 00 00 01 3f a3 a5 d8 40 00 14d+04:09:04.622 READ FPDMA QUEUED
60 00 28 00 00 00 01 3f a3 54 40 40 00 14d+04:09:04.608 READ FPDMA QUEUED
60 00 28 00 00 00 01 a6 3e f9 f0 40 00 14d+04:09:04.537 READ FPDMA QUEUED
60 00 28 00 08 00 01 a6 3e f9 80 40 00 14d+04:09:04.537 READ FPDMA QUEUED
Error 12 [11] occurred at disk power-on lifetime: 25603 hours (1066 days + 19 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 00 00 00 a3 19 c6 b0 40 00 Error: UNC at LBA = 0xa319c6b0 = 2736375472
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 00 20 00 00 00 00 a3 19 c6 b0 40 00 14d+03:30:09.407 READ FPDMA QUEUED
60 00 28 00 00 00 00 a3 19 c6 88 40 00 14d+03:30:09.249 READ FPDMA QUEUED
60 00 20 00 00 00 00 e5 ca 91 70 40 00 14d+03:30:09.170 READ FPDMA QUEUED
60 00 28 00 08 00 01 8a 7f b3 a8 40 00 14d+03:30:08.869 READ FPDMA QUEUED
60 00 20 00 00 00 01 8a 7f b5 00 40 00 14d+03:30:08.869 READ FPDMA QUEUED
Error 11 [10] occurred at disk power-on lifetime: 25601 hours (1066 days + 17 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 00 00 00 6c c3 57 90 40 00 Error: UNC at LBA = 0x6cc35790 = 1824741264
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 00 20 00 00 00 00 6c c3 57 90 40 00 14d+02:05:24.445 READ FPDMA QUEUED
60 00 28 00 00 00 00 6c c3 57 40 40 00 14d+02:05:24.439 READ FPDMA QUEUED
60 00 28 00 00 00 00 6c c3 57 68 40 00 14d+02:05:24.433 READ FPDMA QUEUED
60 00 20 00 00 00 00 6c c3 56 f8 40 00 14d+02:05:24.420 READ FPDMA QUEUED
60 00 28 00 00 00 00 6c c3 56 d0 40 00 14d+02:05:24.413 READ FPDMA QUEUED
Error 10 [9] occurred at disk power-on lifetime: 25599 hours (1066 days + 15 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 00 00 00 42 7e d6 08 40 00 Error: UNC at LBA = 0x427ed608 = 1115608584
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 00 20 00 00 00 00 42 7e d6 08 40 00 13d+23:55:21.329 READ FPDMA QUEUED
60 00 28 00 00 00 01 94 30 10 f8 40 00 13d+23:55:21.322 READ FPDMA QUEUED
60 00 28 00 00 00 00 db 3f bf 48 40 00 13d+23:55:21.313 READ FPDMA QUEUED
60 00 28 00 00 00 01 94 30 10 10 40 00 13d+23:55:21.166 READ FPDMA QUEUED
60 00 28 00 00 00 01 ae 2d ef b8 40 00 13d+23:55:21.152 READ FPDMA QUEUED
SMART Extended Self-test Log Version: 1 (1 sectors)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 25625 -
# 2 Extended offline Completed without error 00% 25623 -
# 3 Short offline Completed without error 00% 25480 -
# 4 Short offline Completed without error 00% 25312 -
# 5 Extended offline Completed without error 00% 25282 -
# 6 Short offline Completed without error 00% 25144 -
# 7 Short offline Completed without error 00% 24976 -
# 8 Short offline Completed without error 00% 24808 -
# 9 Short offline Completed without error 00% 24640 -
#10 Extended offline Completed without error 00% 24539 -
#11 Short offline Completed without error 00% 24473 -
#12 Short offline Completed without error 00% 24305 -
#13 Short offline Completed without error 00% 24137 -
#14 Short offline Completed without error 00% 23969 -
#15 Short offline Completed without error 00% 23801 -
#16 Extended offline Completed without error 00% 23796 -
#17 Short offline Completed without error 00% 23634 -
#18 Short offline Completed without error 00% 23466 -
Selective Self-tests/Logging not supported
SCT Status Version: 3
SCT Version (vendor specific): 258 (0x0102)
Device State: Active (0)
Current Temperature: 25 Celsius
Power Cycle Min/Max Temperature: 24/28 Celsius
Lifetime Min/Max Temperature: 22/45 Celsius
Under/Over Temperature Limit Count: 0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
SCT Temperature History Version: 2
Temperature Sampling Period: 1 minute
Temperature Logging Interval: 1 minute
Min/Max recommended Temperature: 0/65 Celsius
Min/Max Temperature Limit: -41/85 Celsius
Temperature History Size (Index): 478 (295)
Index Estimated Time Temperature Celsius
296 2025-02-26 02:57 24 *****
... ..(144 skipped). .. *****
441 2025-02-26 05:22 24 *****
442 2025-02-26 05:23 25 ******
... ..( 46 skipped). .. ******
11 2025-02-26 06:10 25 ******
12 2025-02-26 06:11 26 *******
... ..( 94 skipped). .. *******
107 2025-02-26 07:46 26 *******
108 2025-02-26 07:47 25 ******
... ..(181 skipped). .. ******
290 2025-02-26 10:49 25 ******
291 2025-02-26 10:50 24 *****
... ..( 3 skipped). .. *****
295 2025-02-26 10:54 24 *****
SCT Error Recovery Control:
Read: 70 (7.0 seconds)
Write: 70 (7.0 seconds)
Device Statistics (GP Log 0x04)
Page Offset Size Value Flags Description
0x01 ===== = = === == General Statistics (rev 1) ==
0x01 0x008 4 56 --- Lifetime Power-On Resets
0x01 0x010 4 25636 --- Power-on Hours
0x01 0x018 6 40795032900 --- Logical Sectors Written
0x01 0x020 6 1534462615 --- Number of Write Commands
0x01 0x028 6 253134394217 --- Logical Sectors Read
0x01 0x030 6 1956678072 --- Number of Read Commands
0x01 0x038 6 2095286784 --- Date and Time TimeStamp
0x03 ===== = = === == Rotating Media Statistics (rev 1) ==
0x03 0x008 4 25515 --- Spindle Motor Power-on Hours
0x03 0x010 4 25490 --- Head Flying Hours
0x03 0x018 4 120 --- Head Load Events
0x03 0x020 4 0 --- Number of Reallocated Logical Sectors
0x03 0x028 4 0 --- Read Recovery Attempts
0x03 0x030 4 0 --- Number of Mechanical Start Failures
0x03 0x038 4 0 --- Number of Realloc. Candidate Logical Sectors
0x03 0x040 4 31 --- Number of High Priority Unload Events
0x04 ===== = = === == General Errors Statistics (rev 1) ==
0x04 0x008 4 17 --- Number of Reported Uncorrectable Errors
0x04 0x010 4 0 --- Resets Between Cmd Acceptance and Completion
0x05 ===== = = === == Temperature Statistics (rev 1) ==
0x05 0x008 1 26 --- Current Temperature
0x05 0x010 1 27 --- Average Short Term Temperature
0x05 0x018 1 26 --- Average Long Term Temperature
0x05 0x020 1 44 --- Highest Temperature
0x05 0x028 1 22 --- Lowest Temperature
0x05 0x030 1 43 --- Highest Average Short Term Temperature
0x05 0x038 1 24 --- Lowest Average Short Term Temperature
0x05 0x040 1 38 --- Highest Average Long Term Temperature
0x05 0x048 1 25 --- Lowest Average Long Term Temperature
0x05 0x050 4 0 --- Time in Over-Temperature
0x05 0x058 1 65 --- Specified Maximum Operating Temperature
0x05 0x060 4 0 --- Time in Under-Temperature
0x05 0x068 1 0 --- Specified Minimum Operating Temperature
0x06 ===== = = === == Transport Statistics (rev 1) ==
0x06 0x008 4 563 --- Number of Hardware Resets
0x06 0x010 4 179 --- Number of ASR Events
0x06 0x018 4 0 --- Number of Interface CRC Errors
|||_ C monitored condition met
||__ D supports DSN
|___ N normalized value
Pending Defects log (GP Log 0x0c) not supported
SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x0001 2 0 Command failed due to ICRC error
0x0002 2 0 R_ERR response for data FIS
0x0003 2 0 R_ERR response for device-to-host data FIS
0x0004 2 0 R_ERR response for host-to-device data FIS
0x0005 2 0 R_ERR response for non-data FIS
0x0006 2 0 R_ERR response for device-to-host non-data FIS
0x0007 2 0 R_ERR response for host-to-device non-data FIS
0x0008 2 0 Device-to-host non-data FIS retries
0x0009 2 1 Transition from drive PhyRdy to drive PhyNRdy
0x000a 2 2 Device-to-host register FISes sent due to a COMRESET
0x000b 2 0 CRC errors within host-to-device FIS
0x000d 2 0 Non-CRC errors within host-to-device FIS
0x000f 2 0 R_ERR response for host-to-device data FIS, CRC
0x0012 2 0 R_ERR response for host-to-device non-data FIS, CRC
0x8000 4 47593 Vendor specific
Only if you attribute the checksum errors to the drive. If they are seperate things then the only error presenting related specifically to the drive are read errors. smartctl -l scterc /dev/sdf shows the expected default of 70.
I’ve been thinking of reorganizing the pool for awhile now. Everything is already backed up in AWS Glacier Deep backup so restoring would not be quick thus I’ve been putting it off at least 8 months (backup was intended as a failsafe, not transitional storage). It’s more data than I could reasonable obtain secondary storage for locally, honest I may never get to it. My highest priority is expandability. When my pool get full I want a way to add more drives and extend the existing organization I’ve got rather than add new spaces. When I set it up I stupidly thought I could add drives to a VDEV or switch to RAIDZ2/3 later and thus the best course was one giant VDEV.
Now my what I think would be ideal for me is doing everything in 4 drive RAIDZ1 VDEVS (a decision made before the first expansion). I think the risk of having 2 out of 4 drives in a given VDEV is acceptable, and it would suit my requirements of being expandable.
Pool is at 72% capacity I’m gonna need another expansion soon so this is actually presient and I’m open to suggestions.
The extended data tells me that you have had 17 Uncorrectable Read Errors for the lifetime of the drive.
0x04 0x008 4 17 --- Number of Reported Uncorrectable Errors
I can’t explain why these were not echoed in the ID 198 value.
The code basically says that HEAD #16 (always this head) is having an uncorrectable bit while reading many different sectors on the disk. It is not defined to a specific area. But there are not 16 heads and this is a holdover from the old days, 16 logical heads.
So what I am saying is, it does look like your drive is bad, but I still don’t understand why it was not in the standard data section.
As for redoing your pool, you could think about just backing up the data you absolutely need on hand, then rebuild, copy that data back to make your system operational, finally start the very long restoration of all the other data from the off site location. That is the only easy option I can see when someone has a huge amount of data.