Help requested please: status: One or more devices are faulted in response to IO failures

Hello

ElectricEel-24.10.2 : how do I fix an external SSD which is now seen as SUSPENDED please?

At home this morning we suffered a 10 second power cut; my TrueNAS was supported on a UPS and suffered no changes, except for an externally powered SSD in a USB caddy which formed a "temporary" pool. I didn’t think to put its PSU on the UPS circuit and hence whilst TN continued on the UPS, the SSD didn’t.

This is what I see in the TN warnings, as a result:

CRITICAL
Pool temporary state is SUSPENDED: One or more devices are faulted in response to IO failures.
The following devices are not healthy:
Disk CT240BX500SSD1 2402E88E5F48 is FAULTED
2025-03-20 07:55:48 (Europe/London)

Next I did

# zpool status -v temporary   
  pool: temporary
 state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-JQ
  scan: scrub repaired 0B in 01:04:44 with 0 errors on Sun Mar  9 01:04:49 2025
config:
	NAME                                    STATE     READ WRITE CKSUM
	temporary                               UNAVAIL      0     0     0  insufficient replicas
	 d0da6a9b-477f-44ca-b984-aa5329d7bc99  UNAVAIL      3   727     0
errors: List of errors unavailable: pool I/O is currently suspended

followed by

# zpool clear temporary
cannot clear errors for temporary: I/O error

and finally

]# dmesg | grep -i error
[964874.141017] zio pool=temporary vdev=/dev/disk/by-partuuid/d0da6a9b-477f-44ca-b984-aa5329d7bc99 error=5 type=2 offset=32705273856 size=61440 flags=1074267264
[964874.142794] zio pool=temporary vdev=/dev/disk/by-partuuid/d0da6a9b-477f-44ca-b984-aa5329d7bc99 error=5 type=2 offset=32707465216 size=61440 flags=1074267264
[964874.174595] zio pool=temporary vdev=/dev/disk/by-partuuid/d0da6a9b-477f-44ca-b984-aa5329d7bc99 error=5 type=2 offset=230255095808 size=917504 flags=1074267264

edit: in the UI

I also tried setting the SSD to OFFLINE:

middlewared.service_exception.CallError: [EZFS_POOLUNAVAIL] cannot offline /dev/disk/by-partuuid/d0da6a9b-477f-44ca-b984-aa5329d7bc99: pool I/O is currently suspended

I have unplugged and reconnected the USB cable and I have power cycled the caddy with the SSD in it (the caddy’s PSU has tested OK so I think the SSD is OK but not seen by TN as such).

and then I decided I didn’t know what I was doing and I should seek help here!

This pool is not critical - it is part of a hobby/experimenting setup. I can easily format the SSD or try it in a different machine, or do some #zsh tests.

(But I would like to fix it because it is the backing store for my Frigate NVR).

One last experiment before I disappear for the day: smartctl c/o

# smartctl -a -x /dev/sdd
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.44-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family:     Crucial/Micron Client SSDs
Device Model:     CT240BX500SSD1
Serial Number:    2402E88E5F48
LU WWN Device Id: 5 00a075 1e88e5f48
Firmware Version: M6CR056
User Capacity:    240,057,409,536 bytes [240 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available
Device is:        In smartctl database 7.3/5660
ATA Version is:   ACS-3 T13/2161-D revision 4
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu Mar 20 08:34:44 2025 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Unavailable
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(  120) seconds.
Offline data collection
capabilities: 			(0x11) SMART execute Offline immediate.
					No Auto Offline data collection support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					No Selective Self-test supported.
SMART capabilities:            (0x0002)	Does not save SMART data before
					entering power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	(   2) minutes.
Extended self-test routine
recommended polling time: 	(  10) minutes.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   100   100   000    -    0
  5 Reallocate_NAND_Blk_Cnt -O--CK   100   100   010    -    0
  9 Power_On_Hours          -O--CK   100   100   000    -    5028
 12 Power_Cycle_Count       -O--CK   100   100   000    -    25
171 Program_Fail_Count      -O--CK   100   100   000    -    0
172 Erase_Fail_Count        -O--CK   100   100   000    -    0
173 Ave_Block-Erase_Count   -O--CK   016   016   000    -    846
174 Unexpect_Power_Loss_Ct  -O--CK   100   100   000    -    23
180 Unused_Reserve_NAND_Blk PO--CK   100   100   000    -    12
183 SATA_Interfac_Downshift -O--CK   100   100   000    -    0
184 Error_Correction_Count  -O--CK   100   100   000    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
194 Temperature_Celsius     -O---K   065   053   000    -    35 (Min/Max 23/47)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
197 Current_Pending_ECC_Cnt -O--CK   100   100   000    -    0
198 Offline_Uncorrectable   ----CK   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   100   100   000    -    0
202 Percent_Lifetime_Remain ----CK   016   016   001    -    84
206 Write_Error_Rate        -OSR--   100   100   000    -    0
210 Success_RAIN_Recov_Cnt  -O--CK   100   100   000    -    0
246 Total_LBAs_Written      -O--CK   100   100   000    -    24903814127
247 Host_Program_Page_Count -O--CK   100   100   000    -    778244191
248 FTL_Program_Page_Count  -O--CK   100   100   000    -    13815380144
249 Unkn_CrucialMicron_Attr -O--CK   100   100   000    -    0
250 Read_Error_Retry_Rate   -O--CK   100   100   000    -    0
251 Unkn_CrucialMicron_Attr -O--CK   100   100   000    -    507734164
252 Unkn_CrucialMicron_Attr -O--CK   100   100   000    -    73
253 Unkn_CrucialMicron_Attr -O--CK   100   100   000    -    0
254 Unkn_CrucialMicron_Attr -O--CK   100   100   000    -    0
223 Unkn_CrucialMicron_Attr -O--CK   100   100   000    -    2
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning
General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x24       GPL     R/O     88  Current Device Internal Status Data log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
SMART Extended Comprehensive Error Log (GP Log 0x03) not supported
SMART Error Log not supported
SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      5023         -
# 2  Extended offline    Completed without error       00%      5002         -
# 3  Short offline       Completed without error       00%      5000         -
# 4  Short offline       Completed without error       00%      4976         -
# 5  Short offline       Completed without error       00%      4953         -
# 6  Short offline       Completed without error       00%      4929         -
# 7  Short offline       Completed without error       00%      4906         -
# 8  Short offline       Completed without error       00%      4882         -
# 9  Short offline       Completed without error       00%      4859         -
#10  Extended offline    Completed without error       00%      4837         -
#11  Short offline       Completed without error       00%      4835         -
#12  Short offline       Completed without error       00%      4812         -
#13  Short offline       Completed without error       00%      4788         -
#14  Short offline       Completed without error       00%      4765         -
#15  Short offline       Completed without error       00%      4741         -
#16  Short offline       Completed without error       00%      4718         -
#17  Short offline       Completed without error       00%      4694         -
#18  Extended offline    Completed without error       00%      4673         -
#19  Short offline       Completed without error       00%      4671         -
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      5023         -
# 2  Extended offline    Completed without error       00%      5002         -
# 3  Short offline       Completed without error       00%      5000         -
# 4  Short offline       Completed without error       00%      4976         -
# 5  Short offline       Completed without error       00%      4953         -
# 6  Short offline       Completed without error       00%      4929         -
# 7  Short offline       Completed without error       00%      4906         -
# 8  Short offline       Completed without error       00%      4882         -
# 9  Short offline       Completed without error       00%      4859         -
#10  Extended offline    Completed without error       00%      4837         -
#11  Short offline       Completed without error       00%      4835         -
#12  Short offline       Completed without error       00%      4812         -
#13  Short offline       Completed without error       00%      4788         -
#14  Short offline       Completed without error       00%      4765         -
#15  Short offline       Completed without error       00%      4741         -
#16  Short offline       Completed without error       00%      4718         -
#17  Short offline       Completed without error       00%      4694         -
#18  Extended offline    Completed without error       00%      4673         -
#19  Short offline       Completed without error       00%      4671         -
#20  Short offline       Completed without error       00%      4647         -
#21  Short offline       Completed without error       00%      4624         -
Selective Self-tests/Logging not supported
SCT Commands not supported
Device Statistics (GP/SMART Log 0x04) not supported
Pending Defects log (GP Log 0x0c) not supported
SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  4            0  Command failed due to ICRC error
0x0002  4            0  R_ERR response for data FIS
0x0005  4            0  R_ERR response for non-data FIS
0x000a  4            0  Device-to-host register FISes sent due to a COMRESET

My advice:

  1. Do a SMART long test. If that runs clear…
  2. Try doing sudo zpool clear temporary and see if that brings the pool back online again.

I did the long test on sdd:

but

# sudo zpool clear temporary
reports
cannot clear errors for temporary: I/O error

Various other zpool commands from internet searches did nothing, so I rebooted the TN machine and now the SSD is indicted as being fine:

“Repairing” via a reboot seems a bit drastic so I hope there would have been a way to succeed via a shell comand … any ideas, anyone?