Hello
ElectricEel-24.10.2 : how do I fix an external SSD which is now seen as SUSPENDED
please?
At home this morning we suffered a 10 second power cut; my TrueNAS was supported on a UPS and suffered no changes, except for an externally powered SSD in a USB caddy which formed a "temporary"
pool. I didn’t think to put its PSU on the UPS circuit and hence whilst TN continued on the UPS, the SSD didn’t.
This is what I see in the TN warnings, as a result:
CRITICAL
Pool temporary state is SUSPENDED: One or more devices are faulted in response to IO failures.
The following devices are not healthy:
Disk CT240BX500SSD1 2402E88E5F48 is FAULTED
2025-03-20 07:55:48 (Europe/London)
Next I did
# zpool status -v temporary
pool: temporary
state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-JQ
scan: scrub repaired 0B in 01:04:44 with 0 errors on Sun Mar 9 01:04:49 2025
config:
NAME STATE READ WRITE CKSUM
temporary UNAVAIL 0 0 0 insufficient replicas
d0da6a9b-477f-44ca-b984-aa5329d7bc99 UNAVAIL 3 727 0
errors: List of errors unavailable: pool I/O is currently suspended
followed by
# zpool clear temporary
cannot clear errors for temporary: I/O error
and finally
]# dmesg | grep -i error
[964874.141017] zio pool=temporary vdev=/dev/disk/by-partuuid/d0da6a9b-477f-44ca-b984-aa5329d7bc99 error=5 type=2 offset=32705273856 size=61440 flags=1074267264
[964874.142794] zio pool=temporary vdev=/dev/disk/by-partuuid/d0da6a9b-477f-44ca-b984-aa5329d7bc99 error=5 type=2 offset=32707465216 size=61440 flags=1074267264
[964874.174595] zio pool=temporary vdev=/dev/disk/by-partuuid/d0da6a9b-477f-44ca-b984-aa5329d7bc99 error=5 type=2 offset=230255095808 size=917504 flags=1074267264
edit: in the UI
I also tried setting the SSD to OFFLINE
:
middlewared.service_exception.CallError: [EZFS_POOLUNAVAIL] cannot offline /dev/disk/by-partuuid/d0da6a9b-477f-44ca-b984-aa5329d7bc99: pool I/O is currently suspended
I have unplugged and reconnected the USB cable and I have power cycled the caddy with the SSD in it (the caddy’s PSU has tested OK so I think the SSD is OK but not seen by TN as such).
and then I decided I didn’t know what I was doing and I should seek help here!
This pool is not critical - it is part of a hobby/experimenting setup. I can easily format the SSD or try it in a different machine, or do some #zsh tests.
(But I would like to fix it because it is the backing store for my Frigate NVR).
One last experiment before I disappear for the day: smartctl c/o
# smartctl -a -x /dev/sdd
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.44-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Crucial/Micron Client SSDs
Device Model: CT240BX500SSD1
Serial Number: 2402E88E5F48
LU WWN Device Id: 5 00a075 1e88e5f48
Firmware Version: M6CR056
User Capacity: 240,057,409,536 bytes [240 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available
Device is: In smartctl database 7.3/5660
ATA Version is: ACS-3 T13/2161-D revision 4
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Thu Mar 20 08:34:44 2025 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is: Unavailable
APM feature is: Unavailable
Rd look-ahead is: Enabled
Write cache is: Enabled
DSN feature is: Unavailable
ATA Security is: Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Unavailable
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 120) seconds.
Offline data collection
capabilities: (0x11) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0002) Does not save SMART data before
entering power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 10) minutes.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-K 100 100 000 - 0
5 Reallocate_NAND_Blk_Cnt -O--CK 100 100 010 - 0
9 Power_On_Hours -O--CK 100 100 000 - 5028
12 Power_Cycle_Count -O--CK 100 100 000 - 25
171 Program_Fail_Count -O--CK 100 100 000 - 0
172 Erase_Fail_Count -O--CK 100 100 000 - 0
173 Ave_Block-Erase_Count -O--CK 016 016 000 - 846
174 Unexpect_Power_Loss_Ct -O--CK 100 100 000 - 23
180 Unused_Reserve_NAND_Blk PO--CK 100 100 000 - 12
183 SATA_Interfac_Downshift -O--CK 100 100 000 - 0
184 Error_Correction_Count -O--CK 100 100 000 - 0
187 Reported_Uncorrect -O--CK 100 100 000 - 0
194 Temperature_Celsius -O---K 065 053 000 - 35 (Min/Max 23/47)
196 Reallocated_Event_Count -O--CK 100 100 000 - 0
197 Current_Pending_ECC_Cnt -O--CK 100 100 000 - 0
198 Offline_Uncorrectable ----CK 100 100 000 - 0
199 UDMA_CRC_Error_Count -O--CK 100 100 000 - 0
202 Percent_Lifetime_Remain ----CK 016 016 001 - 84
206 Write_Error_Rate -OSR-- 100 100 000 - 0
210 Success_RAIN_Recov_Cnt -O--CK 100 100 000 - 0
246 Total_LBAs_Written -O--CK 100 100 000 - 24903814127
247 Host_Program_Page_Count -O--CK 100 100 000 - 778244191
248 FTL_Program_Page_Count -O--CK 100 100 000 - 13815380144
249 Unkn_CrucialMicron_Attr -O--CK 100 100 000 - 0
250 Read_Error_Retry_Rate -O--CK 100 100 000 - 0
251 Unkn_CrucialMicron_Attr -O--CK 100 100 000 - 507734164
252 Unkn_CrucialMicron_Attr -O--CK 100 100 000 - 73
253 Unkn_CrucialMicron_Attr -O--CK 100 100 000 - 0
254 Unkn_CrucialMicron_Attr -O--CK 100 100 000 - 0
223 Unkn_CrucialMicron_Attr -O--CK 100 100 000 - 2
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
General Purpose Log Directory Version 1
SMART Log Directory Version 1 [multi-sector log support]
Address Access R/W Size Description
0x00 GPL,SL R/O 1 Log Directory
0x06 SL R/O 1 SMART self-test log
0x07 GPL R/O 1 Extended self-test log
0x10 GPL R/O 1 NCQ Command Error log
0x11 GPL R/O 1 SATA Phy Event Counters log
0x24 GPL R/O 88 Current Device Internal Status Data log
0x30 GPL,SL R/O 9 IDENTIFY DEVICE data log
0x80-0x9f GPL,SL R/W 16 Host vendor specific log
SMART Extended Comprehensive Error Log (GP Log 0x03) not supported
SMART Error Log not supported
SMART Extended Self-test Log Version: 1 (1 sectors)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 5023 -
# 2 Extended offline Completed without error 00% 5002 -
# 3 Short offline Completed without error 00% 5000 -
# 4 Short offline Completed without error 00% 4976 -
# 5 Short offline Completed without error 00% 4953 -
# 6 Short offline Completed without error 00% 4929 -
# 7 Short offline Completed without error 00% 4906 -
# 8 Short offline Completed without error 00% 4882 -
# 9 Short offline Completed without error 00% 4859 -
#10 Extended offline Completed without error 00% 4837 -
#11 Short offline Completed without error 00% 4835 -
#12 Short offline Completed without error 00% 4812 -
#13 Short offline Completed without error 00% 4788 -
#14 Short offline Completed without error 00% 4765 -
#15 Short offline Completed without error 00% 4741 -
#16 Short offline Completed without error 00% 4718 -
#17 Short offline Completed without error 00% 4694 -
#18 Extended offline Completed without error 00% 4673 -
#19 Short offline Completed without error 00% 4671 -
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 5023 -
# 2 Extended offline Completed without error 00% 5002 -
# 3 Short offline Completed without error 00% 5000 -
# 4 Short offline Completed without error 00% 4976 -
# 5 Short offline Completed without error 00% 4953 -
# 6 Short offline Completed without error 00% 4929 -
# 7 Short offline Completed without error 00% 4906 -
# 8 Short offline Completed without error 00% 4882 -
# 9 Short offline Completed without error 00% 4859 -
#10 Extended offline Completed without error 00% 4837 -
#11 Short offline Completed without error 00% 4835 -
#12 Short offline Completed without error 00% 4812 -
#13 Short offline Completed without error 00% 4788 -
#14 Short offline Completed without error 00% 4765 -
#15 Short offline Completed without error 00% 4741 -
#16 Short offline Completed without error 00% 4718 -
#17 Short offline Completed without error 00% 4694 -
#18 Extended offline Completed without error 00% 4673 -
#19 Short offline Completed without error 00% 4671 -
#20 Short offline Completed without error 00% 4647 -
#21 Short offline Completed without error 00% 4624 -
Selective Self-tests/Logging not supported
SCT Commands not supported
Device Statistics (GP/SMART Log 0x04) not supported
Pending Defects log (GP Log 0x0c) not supported
SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x0001 4 0 Command failed due to ICRC error
0x0002 4 0 R_ERR response for data FIS
0x0005 4 0 R_ERR response for non-data FIS
0x000a 4 0 Device-to-host register FISes sent due to a COMRESET