Is my SSD dead? The number of I/O errors exceeded acceptable levels

Thank you @joeschmuck for responding.

I rebooted the NAS again and the SSD is back up and pool was resilvered but I have 3 checksum errors

I don’t have any. Too bad as it could have been an easy fix :slight_smile:

TrueNAS keeps shuffling stuff around. No idea yet how to stop that from happening.

# smartctl -x  /dev/sdh 
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.32-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Crucial/Micron Client SSDs
Device Model:     CT120BX500SSD1
Serial Number:    1919E180FFCC
LU WWN Device Id: 0 000000 000000000
Firmware Version: M6CR013
User Capacity:    120,034,123,776 bytes [120 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Fri Jul 26 13:08:38 2024 EEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Unavailable

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x02)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(  120) seconds.
Offline data collection
capabilities: 			 (0x11) SMART execute Offline immediate.
					No Auto Offline data collection support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					No Selective Self-test supported.
SMART capabilities:            (0x0002)	Does not save SMART data before
					entering power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  10) minutes.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   100   100   050    -    0
  5 Reallocate_NAND_Blk_Cnt -O--CK   100   100   010    -    0
  9 Power_On_Hours          -O--CK   100   100   050    -    43329
 12 Power_Cycle_Count       -O--CK   100   100   050    -    44
171 Program_Fail_Count      -O--CK   100   100   050    -    0
172 Erase_Fail_Count        -O--CK   100   100   050    -    0
173 Ave_Block-Erase_Count   -O--CK   100   100   050    -    35
174 Unexpect_Power_Loss_Ct  -O--CK   100   100   050    -    25
180 Unused_Reserve_NAND_Blk -O--CK   100   100   050    -    100
183 SATA_Interfac_Downshift -O--CK   100   100   050    -    0
184 Error_Correction_Count  -O--CK   100   100   050    -    0
187 Reported_Uncorrect      -O--CK   100   100   050    -    0
194 Temperature_Celsius     -O---K   059   031   050    Past 41 (Min/Max 29/69)
196 Reallocated_Event_Count -O--CK   100   100   050    -    0
197 Current_Pending_ECC_Cnt -O--CK   100   100   050    -    0
198 Offline_Uncorrectable   ----CK   100   100   050    -    0
199 UDMA_CRC_Error_Count    -O--CK   100   100   050    -    2
202 Percent_Lifetime_Remain ----CK   098   098   001    -    98
206 Write_Error_Rate        -OSR-K   100   100   050    -    0
210 Success_RAIN_Recov_Cnt  -O--CK   100   100   050    -    0
246 Total_LBAs_Written      -O--CK   100   100   050    -    1807575503
247 Host_Program_Page_Count -O--CK   100   100   050    -    56486734
248 FTL_Program_Page_Count  -O--CK   100   100   050    -    97475992
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      1  Comprehensive SMART error log
0x03       GPL     R/O      1  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O      8  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x24       GPL     R/O     88  Current Device Internal Status Data log
0x25       GPL     R/O     32  Saved Device Internal Status Data log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log

SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
Device Error Count: 2
	CR     = Command Register
	FEATR  = Features Register
	COUNT  = Count (was: Sector Count) Register
	LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
	LH     = LBA High (was: Cylinder High) Register    ]   LBA
	LM     = LBA Mid (was: Cylinder Low) Register      ] Register
	LL     = LBA Low (was: Sector Number) Register     ]
	DV     = Device (was: Device/Head) Register
	DC     = Device Control Register
	ER     = Error register
	ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2 [1] occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  04 -- 51 00 00 00 00 00 00 00 00 40 00  Error: ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 08 00 38 00 00 09 00 10 30 00 00     00:00:00.000  WRITE FPDMA QUEUED
  61 00 08 00 40 00 00 0b 00 10 30 00 00     00:00:00.000  WRITE FPDMA QUEUED
  61 00 08 00 48 00 00 47 00 f9 30 00 00     00:00:00.000  WRITE FPDMA QUEUED
  61 00 08 00 b8 00 00 3c 00 00 a8 00 00     00:00:00.000  WRITE FPDMA QUEUED
  61 00 08 00 b8 00 00 3c 00 00 a8 00 00     00:00:00.000  WRITE FPDMA QUEUED

Error 1 [0] log entry is empty
SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     43329         -
# 2  Short offline       Interrupted (host reset)      90%     43329         -
# 3  Short offline       Aborted by host               00%     43329         -
# 4  Short offline       Completed without error       00%     43329         -
# 5  Short offline       Completed without error       00%     43319         -
# 6  Short offline       Completed without error       00%     43295         -
# 7  Short offline       Completed without error       00%     43271         -
# 8  Short offline       Completed without error       00%     43248         -
# 9  Short offline       Completed without error       00%     43224         -
#10  Short offline       Completed without error       00%     43200         -
#11  Extended offline    Completed without error       00%     43176         -
#12  Short offline       Completed without error       00%     43152         -
#13  Short offline       Completed without error       00%     43128         -
#14  Short offline       Completed without error       00%     43104         -
#15  Short offline       Completed without error       00%     43080         -
#16  Short offline       Completed without error       00%     43056         -
#17  Short offline       Completed without error       00%     43032         -
#18  Extended offline    Completed without error       00%     43009         -
#19  Short offline       Completed without error       00%     42985         -

Selective Self-tests/Logging not supported

SCT Commands not supported

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4              44  ---  Lifetime Power-On Resets
0x01  0x010  4           43329  ---  Power-on Hours
0x01  0x018  6      1807575503  ---  Logical Sectors Written
0x01  0x020  6        42220435  ---  Number of Write Commands
0x01  0x028  6      1563041600  ---  Logical Sectors Read
0x01  0x030  6        51282609  ---  Number of Read Commands
0x07  =====  =               =  ===  == Solid State Device Statistics (rev 1) ==
0x07  0x008  1               2  ---  Percentage Used Endurance Indicator
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  4            2  Command failed due to ICRC error
0x0002  4            1  R_ERR response for data FIS
0x0005  4            0  R_ERR response for non-data FIS
0x000a  4            3  Device-to-host register FISes sent due to a COMRESET

I ran the test 3 times. The 1st was aborted by mistake. the 2nd aborted by host and I found this in dmesg:

[  531.650234] ata6.00: exception Emask 0x0 SAct 0x4080 SErr 0x0 action 0x6 frozen
[  531.651254] ata6.00: failed command: WRITE FPDMA QUEUED
[  531.652284] ata6.00: cmd 61/50:38:40:b2:d0/00:00:08:00:00/40 tag 7 ncq dma 40960 out
                        res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[  531.654346] ata6.00: status: { DRDY }
[  531.655331] ata6.00: failed command: WRITE FPDMA QUEUED
[  531.656376] ata6.00: cmd 61/88:70:18:66:70/00:00:08:00:00/40 tag 14 ncq dma 69632 out
                        res 40/00:01:04:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  531.658559] ata6.00: status: { DRDY }
[  531.659629] ata6: hard resetting link
[  531.974015] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[  531.994890] ata6.00: configured for UDMA/133
[  531.995192] ata6: EH complete