The number of I/O errors associated with a ZFS device exceeded
acceptable levels. ZFS has marked the device as faulted.
impact: Fault tolerance of the pool may be compromised.
eid: 2566
class: statechange
state: FAULTED
host: corellia
time: 2024-06-10 00:06:51+0300
vpath: /dev/sdi3
vguid: 0xA3827F82C74B2AA2
pool: boot-pool (0x88B45C7FF4BBA0E9)
The SSD seems to be completely unavailable. smartctl cannot open the device.
I pulled it off the NAS, connected it via an external SATA to USB enclosure to my laptop.
Ran smart and badblocks tests (I know badblocks does not make sense for an SSD) and it was all fine.
I moved it again to the NAS and all went fine until today, July the 26th when I received this error:
The number of I/O errors associated with a ZFS device exceeded
acceptable levels. ZFS has marked the device as faulted.
impact: Fault tolerance of the pool may be compromised.
eid: 3371
class: statechange
state: FAULTED
host: corellia
time: 2024-07-26 10:23:46+0300
vpath: /dev/sdj3
vguid: 0xCF08906F09FA2103
pool: boot-pool (0x88B45C7FF4BBA0E9)
This is a crucial SSD used for my mirrored boot pool.
Model Family: Crucial/Micron Client SSDs
Device Model: CT120BX500SSD1
That SSD often reports high temperatures but I searched and read it could be an issue with the firmware.
So, is it broken? Should I just throw it away and shop for something else or could a bug somewhere be causing this corruption? I am leaning towards throwing it away and not buying corsair SSDs again.
This happened during the update from 24.04.1.1 to 24.04.2 (bad timing but I have a mirrored boot pool).
Run smartctl -t short /dev/sdX where “X” is the drive letter. In the above listing you have sdi and sdj?
Wait 3 minutes then, Post the output of smartctl -x /dev/sdX. We will see what the data shows.
Yes, I have personally experienced a Crucial SSD failure due to firmware but that was a very long time ago. You can check the firmware versions but hold off until we see that smart data.
The issue does not initially sound like the drive is dying. High drive temps could be the problem if those are valid.
If you have a lot of UDMA_CRC_ERRORS then odds are it is a SATA cable issue.
Smart data shows the drive is 4 years old. relocated event count 0. percent lifetime remaining 98%
Total_LBAs_Written = 1807575503 , if im not mistaken thats in gigs. Was this drives used for something other that running truenas. The read and write count is very high. i would go looking for a new drive.
You upgraded from 24.04.1.1 so make 24.04.1.1 the Active boot pool. Reboot and see if the problem persists. This can tell you if the hardware is good or not.
Assuming the problem returns…
How is the SSD connected to your machine? If using the HBA, try the onboard SATA port.
If you are already using an onboard SATA port, my suggestion is to make a backup of your TrueNAS configuration file, put it in a safe place. Then wipe both boot-pool drives and reinstall TrueNAS 24.04.2 from the ISO image.
Those are my recommended best steps based on the information provided.
Total_LBAs_Written is in bytes, of corse. ohh how i wish drive manufacturers would stick with a standard. in that case thats not a lot of data written.
The Intel DC series drives are very reliable, and the S3520 is after the shift to 3D NAND, so I imagine you’ll be quite pleased with the years of boring service you’ll get from it.
For those following along, the Crucial BX series has an aggressive wear-leveling feature in its firmware that also causes it to throw false-positives for the “Current Pending Sectors” SMART value whenever it does one a wear-leveling pass. The combination of these two makes it an overall sub-par candidate for use in your TrueNAS system, even as a boot device.
If you’ve still got the BX500 drive @MSameer I’d be interested in a full smartctl -x dump of it, if you’re willing to share either publicly or via DM.
Thank you for the reassurance. I just love boring hardware which you never hear from
The only down side is TN does not come with the intel SSD tools thus I cannot “shrink” the usable space but I don’t think that matters much for a boot drive.
Here goes (It’s connected via an ASM1051E SATA to USB bridge. I can connect it directly if you wise
vader:/tmp# smartctl -x /dev/sda
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.10.6-amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Crucial/Micron Client SSDs
Device Model: CT120BX500SSD1
Serial Number: 1919E180FFCC
LU WWN Device Id: 0 000000 000000000
Firmware Version: M6CR013
User Capacity: 120,034,123,776 bytes [120 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available, deterministic, zeroed
Device is: In smartctl database 7.3/5610
ATA Version is: ACS-2 T13/2015-D revision 3
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Thu Oct 24 10:30:28 2024 EEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is: Unavailable
APM level is: 254 (maximum performance)
Rd look-ahead is: Enabled
Write cache is: Enabled
DSN feature is: Unavailable
ATA Security is: Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Unavailable
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 120) seconds.
Offline data collection
capabilities: (0x11) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0002) Does not save SMART data before
entering power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 10) minutes.
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-K 100 100 050 - 0
5 Reallocate_NAND_Blk_Cnt -O--CK 100 100 010 - 0
9 Power_On_Hours -O--CK 100 100 050 - 45463
12 Power_Cycle_Count -O--CK 100 100 050 - 58
171 Program_Fail_Count -O--CK 100 100 050 - 0
172 Erase_Fail_Count -O--CK 100 100 050 - 0
173 Ave_Block-Erase_Count -O--CK 100 100 050 - 62
174 Unexpect_Power_Loss_Ct -O--CK 100 100 050 - 31
180 Unused_Reserve_NAND_Blk -O--CK 100 100 050 - 100
183 SATA_Interfac_Downshift -O--CK 100 100 050 - 0
184 Error_Correction_Count -O--CK 100 100 050 - 0
187 Reported_Uncorrect -O--CK 100 100 050 - 0
194 Temperature_Celsius -O---K 068 031 050 Past 32 (Min/Max 29/69)
196 Reallocated_Event_Count -O--CK 100 100 050 - 0
197 Current_Pending_ECC_Cnt -O--CK 100 100 050 - 0
198 Offline_Uncorrectable ----CK 100 100 050 - 0
199 UDMA_CRC_Error_Count -O--CK 100 100 050 - 2
202 Percent_Lifetime_Remain ----CK 096 096 001 - 96
206 Write_Error_Rate -OSR-K 100 100 050 - 0
210 Success_RAIN_Recov_Cnt -O--CK 100 100 050 - 0
246 Total_LBAs_Written -O--CK 100 100 050 - 2612922986
247 Host_Program_Page_Count -O--CK 100 100 050 - 81653843
248 FTL_Program_Page_Count -O--CK 100 100 050 - 246685256
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
General Purpose Log Directory Version 1
SMART Log Directory Version 1 [multi-sector log support]
Address Access R/W Size Description
0x00 GPL,SL R/O 1 Log Directory
0x01 SL R/O 1 Summary SMART error log
0x02 SL R/O 1 Comprehensive SMART error log
0x03 GPL R/O 1 Ext. Comprehensive SMART error log
0x04 GPL,SL R/O 8 Device Statistics log
0x06 SL R/O 1 SMART self-test log
0x07 GPL R/O 1 Extended self-test log
0x10 GPL R/O 1 NCQ Command Error log
0x11 GPL R/O 1 SATA Phy Event Counters log
0x24 GPL R/O 88 Current Device Internal Status Data log
0x25 GPL R/O 32 Saved Device Internal Status Data log
0x30 GPL,SL R/O 9 IDENTIFY DEVICE data log
0x80-0x9f GPL,SL R/W 16 Host vendor specific log
SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
Device Error Count: 2
CR = Command Register
FEATR = Features Register
COUNT = Count (was: Sector Count) Register
LBA_48 = Upper bytes of LBA High/Mid/Low Registers ] ATA-8
LH = LBA High (was: Cylinder High) Register ] LBA
LM = LBA Mid (was: Cylinder Low) Register ] Register
LL = LBA Low (was: Sector Number) Register ]
DV = Device (was: Device/Head) Register
DC = Device Control Register
ER = Error register
ST = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 2 [1] occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
04 -- 51 00 00 00 00 00 00 00 00 40 00 Error: ABRT at LBA = 0x00000000 = 0
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
61 00 08 00 38 00 00 09 00 10 30 00 00 00:00:00.000 WRITE FPDMA QUEUED
61 00 08 00 40 00 00 0b 00 10 30 00 00 00:00:00.000 WRITE FPDMA QUEUED
61 00 08 00 48 00 00 47 00 f9 30 00 00 00:00:00.000 WRITE FPDMA QUEUED
61 00 08 00 b8 00 00 3c 00 00 a8 00 00 00:00:00.000 WRITE FPDMA QUEUED
61 00 08 00 b8 00 00 3c 00 00 a8 00 00 00:00:00.000 WRITE FPDMA QUEUED
Error 1 [0] log entry is empty
SMART Extended Self-test Log Version: 1 (1 sectors)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 45441 -
# 2 Short offline Completed without error 00% 45417 -
# 3 Short offline Completed without error 00% 45393 -
# 4 Short offline Completed without error 00% 45369 -
# 5 Extended offline Completed without error 00% 45346 -
# 6 Short offline Completed without error 00% 45322 -
# 7 Short offline Completed without error 00% 45298 -
# 8 Short offline Completed without error 00% 45274 -
# 9 Short offline Completed without error 00% 45250 -
#10 Short offline Completed without error 00% 45226 -
#11 Short offline Completed without error 00% 45202 -
#12 Extended offline Completed without error 00% 45178 -
#13 Short offline Completed without error 00% 45154 -
#14 Short offline Completed without error 00% 45130 -
#15 Short offline Completed without error 00% 45107 -
#16 Short offline Completed without error 00% 45083 -
#17 Short offline Completed without error 00% 45059 -
#18 Short offline Completed without error 00% 45035 -
#19 Extended offline Completed without error 00% 45011 -
Selective Self-tests/Logging not supported
SCT Commands not supported
Device Statistics (GP Log 0x04)
Page Offset Size Value Flags Description
0x01 ===== = = === == General Statistics (rev 1) ==
0x01 0x008 4 58 --- Lifetime Power-On Resets
0x01 0x010 4 45463 --- Power-on Hours
0x01 0x018 6 2612922986 --- Logical Sectors Written
0x01 0x020 6 72819815 --- Number of Write Commands
0x01 0x028 6 1651517831 --- Logical Sectors Read
0x01 0x030 6 52176492 --- Number of Read Commands
0x07 ===== = = === == Solid State Device Statistics (rev 1) ==
0x07 0x008 1 4 --- Percentage Used Endurance Indicator
|||_ C monitored condition met
||__ D supports DSN
|___ N normalized value
Pending Defects log (GP Log 0x0c) not supported
SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x0001 4 0 Command failed due to ICRC error
0x0002 4 0 R_ERR response for data FIS
0x0005 4 0 R_ERR response for non-data FIS
0x000a 4 1 Device-to-host register FISes sent due to a COMRESET
Nothing really strikes my attention there but I am not an expert.
We actually do have provisioning available through the disk_resize command (eg: disk_resize sdX 16G) but it would be rather difficult (impossible) to do on a live disk and have the data remain intact You’d have to do this from a different boot device and then reinstall to the S3520.
This is the part I’m interested in. The flash translation layer says it’s done ~3x the number of page programming operations as were sent to the disk, probably from the SLC caching behavior, but it’s not logged anything much more intense than that for the wear leveling. Total writes look like they’re in the neighborhood of 1.3T so it’s a pretty lightly used device overall.