"One or more devices has experienced an unrecoverable error"

Hello, good evening.

I am using TrueNAS Community Edition 25.04.0. I just received an alert from my TrueNAS via email.

I have a mirrored pool with 2 Kingston SSDs, model A400, 480 GB. After a periodic scrub, one of the SSDs is showing “4 errors,” consisting of 3 read errors and 1 checksum error:

When I tried to run a SMART test, this is what I see (“No errors”):

Scrutiny app shows me this:

I have an extra SSD here, of the same model. The problem is that I don’t have any available SATA ports at the moment; all 6 are in use.

How should I proceed in this case? What should I do? Should I run more tests?

Should I replace this SSD as soon as possible? If so, how to proceed?

Should I follow this guide?

And lastly, I’ll probably only be able to open the case tomorrow. Should I turn off the NAS right now, just to be safe?

Thanks!

To help with diagnosis, please open the web shell or a SSH session and provide the output of some commands as formatted text (</> button):
sudo zpool status
sudo smartctl -x /dev/sdc (assuming you have not rebooted)

No. It doesn’t look that critical.

Bare metal install? All drives on motherboard ports? Wich motherboard/CPU/RAM by the way?
If you have to replace the drive, you can just take it out and put the new drive in its place; optionally, you may put the old drive on a USB adapter so it’s still available to provide redundancy during resilver.

1 Like

I ended up not running the suggested commands. I wasn’t too concerned since you said it wouldn’t be anything critical.

Just yesterday, I opened up the case and replaced the SSD that was having issues. The resilvering process completed successfully, and I haven’t had any problems since.

I’ve been thinking about buying an HBA card to add 4 more SATA ports to my NAS and keep a few drives as hot spares…

Anyway, thanks for the support!

1 Like

Good evening!

Today, the issue was with the boot-pool:

When running a scrub on this SSD:

zpool status:

  pool: boot-pool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 00:00:14 with 0 errors on Sun May  4 11:18:55 2025
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          sda3      ONLINE       0     2     0

errors: No known data errors

smartctl -x /dev/sda:

smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.15-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Phison Driven SSDs
Device Model:     KINGSTON SA400S37480G
Serial Number:    50026B778441289A
LU WWN Device Id: 5 0026b7 78441289a
Firmware Version: SAH20105
User Capacity:    480,103,981,056 bytes [480 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
TRIM Command:     Available
Device is:        In smartctl database 7.3/5706
ATA Version is:   ACS-3 T13/2161-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun May  4 19:40:22 2025 -03
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Unavailable

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  120) seconds.
Offline data collection
capabilities:                    (0x11) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0002) Does not save SMART data before
                                        entering power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  10) minutes.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     -O--CK   100   100   000    -    100
  9 Power_On_Hours          -O--CK   100   100   000    -    6358
 12 Power_Cycle_Count       -O--CK   100   100   000    -    187
148 Unknown_Attribute       ------   100   100   000    -    0
149 Unknown_Attribute       ------   100   100   000    -    0
167 Write_Protect_Mode      ------   100   100   000    -    0
168 SATA_Phy_Error_Count    -O--C-   100   100   000    -    1
169 Bad_Block_Rate          ------   100   100   000    -    0
170 Bad_Blk_Ct_Lat/Erl      ------   100   100   010    -    0/0
172 Erase_Fail_Count        -O--CK   100   100   000    -    0
173 MaxAvgErase_Ct          ------   100   100   000    -    0
181 Program_Fail_Count      -O--CK   100   100   000    -    0
182 Erase_Fail_Count        ------   100   100   000    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
192 Unsafe_Shutdown_Count   -O--C-   100   100   000    -    112
194 Temperature_Celsius     -O---K   030   039   000    -    30 (Min/Max 19/39)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
199 SATA_CRC_Error_Count    -O--CK   100   100   000    -    0
218 CRC_Error_Count         -O--CK   100   100   000    -    1
231 SSD_Life_Left           ------   099   099   000    -    99
233 Flash_Writes_GiB        -O--CK   100   100   000    -    936
241 Lifetime_Writes_GiB     -O--CK   100   100   000    -    1601
242 Lifetime_Reads_GiB      -O--CK   100   100   000    -    540
244 Average_Erase_Count     ------   100   100   000    -    11
245 Max_Erase_Count         ------   100   100   000    -    29
246 Total_Erase_Count       ------   100   100   000    -    6066
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      1  Comprehensive SMART error log
0x03       GPL     R/O      1  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O      8  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xde       GPL     VS       8  Device vendor specific log

SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
Device Error Count: 3
        CR     = Command Register
        FEATR  = Features Register
        COUNT  = Count (was: Sector Count) Register
        LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
        LH     = LBA High (was: Cylinder High) Register    ]   LBA
        LM     = LBA Mid (was: Cylinder Low) Register      ] Register
        LL     = LBA Low (was: Sector Number) Register     ]
        DV     = Device (was: Device/Head) Register
        DC     = Device Control Register
        ER     = Error register
        ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 3 [2] occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  04 -- 51 00 00 00 00 00 00 00 00 40 00  Error: ABRT

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  b0 00 d1 01 01 00 00 4f 00 c2 01 00 08     00:00:00.000  SMART READ ATTRIBUTE THRESHOLDS [OBS-4]
  2f 00 00 01 01 00 00 00 00 00 03 00 08     00:00:00.000  READ LOG EXT
  2f 00 00 01 01 00 00 00 00 00 00 00 08     00:00:00.000  READ LOG EXT
  b0 00 d5 01 01 00 00 4f 00 c2 00 00 08     00:00:00.000  SMART READ LOG
  b0 00 da 00 00 00 00 4f 00 c2 00 00 08     00:00:00.000  SMART RETURN STATUS

Error 2 [1] log entry is empty
Error 1 [0] log entry is empty
SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       497         -

Selective Self-tests/Logging not supported

SCT Commands not supported

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4             187  ---  Lifetime Power-On Resets
0x01  0x010  4            6358  ---  Power-on Hours
0x01  0x018  6      3359628590  ---  Logical Sectors Written
0x01  0x020  6          527969  ---  Number of Write Commands
0x01  0x028  6      1134221608  ---  Logical Sectors Read
0x01  0x030  6          160682  ---  Number of Read Commands
0x07  =====  =               =  ===  == Solid State Device Statistics (rev 1) ==
0x07  0x008  1               1  ---  Percentage Used Endurance Indicator
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  4            1  Command failed due to ICRC error
0x0002  4            1  R_ERR response for data FIS
0x0005  4            0  R_ERR response for non-data FIS
0x000a  4            3  Device-to-host register FISes sent due to a COMRESET

How should I proceed?

Thank you!