Boot pool degraded after reboot

Hello there,

after reboot my boot pool got errors. SMART tests ended with PASSED.

TrueNAS Log:

Jun 26 02:52:22 njetflix syslog-ng[3248]: Error suspend timeout has elapsed, attempting to write again; fd='79'
Jun 26 02:52:22 njetflix syslog-ng[3248]: Suspending write operation because of an I/O error; fd='79', time_reopen='60'
Jun 26 02:52:48 njetflix syslog-ng[3248]: Error suspend timeout has elapsed, attempting to write again; fd='82'
Jun 26 02:52:48 njetflix syslog-ng[3248]: Suspending write operation because of an I/O error; fd='82', time_reopen='60'
Jun 26 02:53:22 njetflix syslog-ng[3248]: Error suspend timeout has elapsed, attempting to write again; fd='79'
Jun 26 02:53:22 njetflix syslog-ng[3248]: Suspending write operation because of an I/O error; fd='79', time_reopen='60'
Jun 26 02:53:48 njetflix syslog-ng[3248]: Error suspend timeout has elapsed, attempting to write again; fd='82'
Jun 26 02:53:49 njetflix syslog-ng[3248]: Suspending write operation because of an I/O error; fd='82', time_reopen='60'
Jun 26 02:54:22 njetflix syslog-ng[3248]: Error suspend timeout has elapsed, attempting to write again; fd='79'
Jun 26 02:54:22 njetflix syslog-ng[3248]: Suspending write operation because of an I/O error; fd='79', time_reopen='60'
Jun 26 02:54:49 njetflix syslog-ng[3248]: Error suspend timeout has elapsed, attempting to write again; fd='82'
Jun 26 02:54:49 njetflix syslog-ng[3248]: Suspending write operation because of an I/O error; fd='82', time_reopen='60'
Jun 26 02:55:22 njetflix syslog-ng[3248]: Error suspend timeout has elapsed, attempting to write again; fd='79'
Jun 26 02:55:22 njetflix syslog-ng[3248]: Suspending write operation because of an I/O error; fd='79', time_reopen='60'
Jun 26 02:55:49 njetflix syslog-ng[3248]: Error suspend timeout has elapsed, attempting to write again; fd='82'
Jun 26 02:55:49 njetflix syslog-ng[3248]: Suspending write operation because of an I/O error; fd='82', time_reopen='60'
Jun 26 02:56:22 njetflix syslog-ng[3248]: Error suspend timeout has elapsed, attempting to write again; fd='79'
Jun 26 02:56:22 njetflix syslog-ng[3248]: Suspending write operation because of an I/O error; fd='79', time_reopen='60'
Jun 26 02:56:49 njetflix syslog-ng[3248]: Error suspend timeout has elapsed, attempting to write again; fd='82'
Jun 26 02:56:49 njetflix syslog-ng[3248]: Suspending write operation because of an I/O error; fd='82', time_reopen='60'
Jun 26 02:57:22 njetflix syslog-ng[3248]: Error suspend timeout has elapsed, attempting to write again; fd='79'
Jun 26 02:57:23 njetflix syslog-ng[3248]: Suspending write operation because of an I/O error; fd='79', time_reopen='60'
Jun 26 02:57:49 njetflix syslog-ng[3248]: Error suspend timeout has elapsed, attempting to write again; fd='82'
Jun 26 02:57:49 njetflix syslog-ng[3248]: Suspending write operation because of an I/O error; fd='82', time_reopen='60'
Jun 26 02:58:23 njetflix syslog-ng[3248]: Error suspend timeout has elapsed, attempting to write again; fd='79'
Jun 26 02:58:23 njetflix syslog-ng[3248]: Suspending write operation because of an I/O error; fd='79', time_reopen='60'
Jun 26 02:58:49 njetflix syslog-ng[3248]: Error suspend timeout has elapsed, attempting to write again; fd='82'
Jun 26 02:58:49 njetflix syslog-ng[3248]: Suspending write operation because of an I/O error; fd='82', time_reopen='60'
pool: boot-pool
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:00:08 with 55 errors on Thu Jun 26 01:38:30 2025
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   DEGRADED     0     0     0
          sdd3      DEGRADED     0     0   334  too many errors

errors: Permanent errors have been detected in the following files:

        /audit/syslog-ng-00002.rqf
        /audit/SYSTEM.db-wal
        /audit/MIDDLEWARE.db-wal
        /root/.zsh-histfile
        /var/log/samba4/log.samba-dcerpcd
        /var/log/auth.log
        /var/log/sysstat/sa24
        /var/log/audit/audit.log
        /var/log/syslog
        /var/log/kern.log
        /var/log/debug
        /var/log/samba4/log.wb-TRUENAS
        /var/log/audit/audit.log.1
        /var/log/sysstat/sa25
        /var/log/daemon.log
        /var/log/error
        /var/log/sysstat/sa23
        /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/ba5929460e6b0e402582875c5daa6bc9365206416ecc0b762b9162460faa5f4f
        /var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db
        /var/lib/dhcp/dhclient.leases.enp7s0
        /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/9069cad4ed2dcec942d9b889ffc4583a46c38752ccd900c5f5c71b6eddbbb07b

truenas_admin@njetflix[~]$ sudo smartctl -a -x /dev/sdd
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.15-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     INTEL SSDSCKHW120A4
Serial Number:    CVDA51320168120Q
LU WWN Device Id: 5 5cd2e4 04bfc1691
Firmware Version: DC31
User Capacity:    120,034,123,776 bytes [120 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
TRIM Command:     Available, deterministic
Device is:        Not in smartctl database 7.3/5706
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Jun 26 03:06:22 2025 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Unavailable

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x05) Offline data collection activity
                                        was aborted by an interrupting command from host.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (  33) The self-test routine was interrupted
                                        by the host with a hard or soft reset.
Total time to complete Offline
data collection:                ( 2930) seconds.
Offline data collection
capabilities:                    (0x7f) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Abort Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (  48) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x0025) SCT Status supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  5 Reallocated_Sector_Ct   -O--CK   100   100   000    -    0
  9 Power_On_Hours          -O--CK   100   100   000    -    1371 (67 6 0)
 12 Power_Cycle_Count       -O--CK   099   099   000    -    1566
170 Unknown_Attribute       PO--CK   100   100   010    -    0
171 Unknown_Attribute       -O--CK   100   100   000    -    0
172 Unknown_Attribute       -O--CK   100   100   000    -    0
174 Unknown_Attribute       -O--CK   100   100   000    -    23
183 Runtime_Bad_Block       -O--CK   100   100   000    -    9
184 End-to-End_Error        PO--CK   100   100   090    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
190 Airflow_Temperature_Cel -O--CK   034   055   000    -    34 (Min/Max -21/55)
192 Power-Off_Retract_Count -O--CK   100   100   000    -    23
199 UDMA_CRC_Error_Count    -O--CK   100   100   000    -    0
225 Unknown_SSD_Attribute   -O--CK   100   100   000    -    178891
226 Unknown_SSD_Attribute   -O--CK   100   100   000    -    65535
227 Unknown_SSD_Attribute   -O--CK   100   100   000    -    50
228 Power-off_Retract_Count -O--CK   100   100   000    -    65535
232 Available_Reservd_Space PO--CK   100   100   010    -    0
233 Media_Wearout_Indicator -O--CK   100   100   000    -    0
241 Total_LBAs_Written      -O--CK   100   100   000    -    178891
242 Total_LBAs_Read         -O--CK   100   100   000    -    184075
249 Unknown_Attribute       -O--CK   100   100   000    -    11559
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x04       GPL,SL  R/O      1  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL,SL  R/O      1  SATA Phy Event Counters log
0x30       GPL,SL  R/O     16  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xb7       GPL,SL  VS      16  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log (GP Log 0x03) not supported

SMART Error Log not supported

SMART Extended Self-test Log Version: 0 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Offline             Interrupted (host reset)      10%      1371         -
# 2  Extended offline    Completed without error       00%      1370         -
# 3  Extended offline    Completed without error       00%      1370         -
# 4  Offline             Interrupted (host reset)      10%      1369         -
# 5  Extended offline    Completed without error       00%      1369         -
# 6  Short offline       Completed without error       00%      1364         -
# 7  Offline             Interrupted (host reset)      10%      1363         -
# 8  Offline             Interrupted (host reset)      10%      1363         -
# 9  Offline             Interrupted (host reset)      10%      1344         -
#10  Short offline       Completed without error       00%      1344         -
#11  Offline             Interrupted (host reset)      10%      1337         -
#12  Offline             Interrupted (host reset)      10%      1320         -
#13  Conveyance offline  Completed without error       00%      1320         -
#14  Offline             Interrupted (host reset)      10%      1320         -
#15  Offline             Interrupted (host reset)      10%      1288         -
#16  Offline             Interrupted (host reset)      10%      1284         -
#17  Offline             Interrupted (host reset)      10%      1230         -
#18  Offline             Interrupted (host reset)      10%      1230         -
#19  Offline             Interrupted (host reset)      10%      1185         -

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Offline             Interrupted (host reset)      10%      1371         -
# 2  Extended offline    Completed without error       00%      1370         -
# 3  Extended offline    Completed without error       00%      1370         -
# 4  Offline             Interrupted (host reset)      10%      1369         -
# 5  Extended offline    Completed without error       00%      1369         -
# 6  Short offline       Completed without error       00%      1364         -
# 7  Offline             Interrupted (host reset)      10%      1363         -
# 8  Offline             Interrupted (host reset)      10%      1363         -
# 9  Offline             Interrupted (host reset)      10%      1344         -
#10  Short offline       Completed without error       00%      1344         -
#11  Offline             Interrupted (host reset)      10%      1337         -
#12  Offline             Interrupted (host reset)      10%      1320         -
#13  Conveyance offline  Completed without error       00%      1320         -
#14  Offline             Interrupted (host reset)      10%      1320         -
#15  Offline             Interrupted (host reset)      10%      1288         -
#16  Offline             Interrupted (host reset)      10%      1284         -
#17  Offline             Interrupted (host reset)      10%      1230         -
#18  Offline             Interrupted (host reset)      10%      1230         -
#19  Offline             Interrupted (host reset)      10%      1185         -
#20  Offline             Interrupted (host reset)      10%      1165         -
#21  Offline             Interrupted (host reset)      10%      1115         -

SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       0 (0x0000)
Device State:                        Active (0)
Current Temperature:                    34 Celsius
Power Cycle Min/Max Temperature:     -21/55 Celsius
Lifetime    Min/Max Temperature:     -21/66 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Temperature History Version:     0 (Unknown, should be 2)
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        10 minutes
Min/Max recommended Temperature:      0/ 0 Celsius
Min/Max Temperature Limit:            0/ 0 Celsius
Temperature History Size (Index):    0 (410)
Temperature History is empty

SCT Error Recovery Control command not supported

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 2) ==
0x01  0x008  4            1580  ---  Lifetime Power-On Resets
0x01  0x010  4            1378  ---  Power-on Hours
0x01  0x018  6     13075412317  ---  Logical Sectors Written
0x01  0x028  6     13592132470  ---  Logical Sectors Read
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4            3798  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              34  ---  Current Temperature
0x05  0x010  1              36  ---  Average Short Term Temperature
0x05  0x018  1               -  ---  Average Long Term Temperature
0x05  0x020  1              49  ---  Highest Temperature
0x05  0x028  1              21  ---  Lowest Temperature
0x05  0x030  1              36  ---  Highest Average Short Term Temperature
0x05  0x038  1              31  ---  Lowest Average Short Term Temperature
0x05  0x040  1               -  ---  Highest Average Long Term Temperature
0x05  0x048  1               -  ---  Lowest Average Long Term Temperature
0x05  0x050  4               0  ---  Time in Over-Temperature
0x05  0x058  1              70  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4            3798  ---  Number of Hardware Resets
0x06  0x010  4            4532  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
0x07  =====  =               =  ===  == Solid State Device Statistics (rev 1) ==
0x07  0x008  1               4  ---  Percentage Used Endurance Indicator
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            0  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            2  Device-to-host register FISes sent due to a COMRESET
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0010  2            0  R_ERR response for host-to-device data FIS, non-CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x0013  2            0  R_ERR response for host-to-device non-data FIS, non-CRC
0x0002  2            0  R_ERR response for data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS

Get a backup of your TrueNAS Configuration asap. Check your cabling.

You may need a new boot device, install whatever version of TrueNAS you are using and then reload the configuration file.

2 Likes

Config saved. But where are the errors coming from? SSD is a M.2 SATA. So no cables. But I reseated the SSD nevertheless.

Critical
Failed to check for alert APIFailedLogin: [Errno 9] Bad file descriptor
2025-06-26 01:03:56 (Europe/Berlin)
Dismiss
Critical
Failed to configure docker for Applications: Unable to determine default interface
2025-06-26 01:02:57 (Europe/Berlin)
Dismiss
Go to App Settings
Critical
Boot pool status is DEGRADED: One or more devices has experienced an error resulting in data corruption. Applications may be affected..
2025-06-26 01:06:55 (Europe/Berlin)
Dismiss
Go to Boot Pools
Critical
Failed to sync TRUENAS catalog: [EFAULT] Failed to clone 'https://github.com/truenas/apps' repository at '/var/run/middleware/ix-apps/catalogs' destination: [EFAULT] Failed to clone 'https://github.com/truenas/apps' repository at '/var/run/middleware/ix-apps/catalogs' destination: Cloning into '/var/run/middleware/ix-...
2025-06-26 01:16:42 (Europe/Berlin)
Dismiss
Error
Audit service failed backend setup: MIDDLEWARE. See /var/log/middlewared.log
2025-06-26 03:33:54 (Europe/Berlin)
Dismiss
Error
Audit service failed backend setup: SYSTEM. See /var/log/middlewared.log
2025-06-26 03:33:54 (Europe/Berlin)
Dismiss
Error
Failed to perform audit query: [Errno 9] Bad file descriptor
2025-06-26 01:03:56 (Europe/Berlin)
Dismiss

It’s late. Coming back after sleep :smiley:

You should post your full system details. Expand my Details below my posts to get an idea of what info is helpful. You can try powering off and reseating the SSD and see if it works again.
Just guessing a bad or failing SSD at this point with the details provided.

1 Like

Mainboard: ASUS PRIME B450M-K II, UEFI 4631 (90MB1600-M0EAY0)
CPU: AMD Ryzen 7 1700X, 8C/16T, 3.40-3.80GHz (YD170XBCAEWOF)
RAM: 2x 8GB BALLISTIX DIMM Non-ECC DDR4 2666MHz (BLS8G4D26BFSC.16FBD2)
SSD: Intel SSD 530 120GB, M.2 2280 / B-M-Key / SATA 6Gb/s (SSDSCKHW120A4)
PSU: be quiet! Pure Power 11 400W (BN292)
OS: TrueNAS SCALE 25.04.

Adding on link to your other post if it might be related. Have you checked if the SSD is abnormally hot with an IR thermometer or just by touching it?
120 F is about 49 C, the highest listed from your smartctl image.

1 Like

Like we say for some other important things, one is none, two is one. That’s how many of us treat our boot drives. You want a mirrored pair so that in the future, if one croaks, you can limp on the other until you can get a replacement. Boot mirrors make the pillow softer.

IMHO, an nvme is overkill for the boot drive which tends to be mostly readonly. You can even redirect the system dataset (that would keep it busy), onto another pool so it leaves your boot drive alone while it’s running. But let’s get to the point.


(System-advanced settings-storage)

Go order a cheap pair of SATA ssd drives, I’ve seen two packs of 256 or smaller drives (even 128 is plenty). Do your reinstall on that, then after it boots and is running, include the second new ssd as a mirror so you have a neat little mirrored boot setup. It’ll be fast enough and not run hot, and you mostly just read little bits of it occasionally post-boot anyway. Finally, import your backup configuration and voila, you won’t see this problem again.

Still reading? @joeschmuck has a nice little reporting script that emails you after scrub events and smart tests so you know your system health, daily. It would be the cherry on top of all this stuff. Not hard to edit or implement, it’s well documented.

1 Like

Thanks @SmallBarky and @afrosheen for your posts. I am already on the go to get new two boot ssds.

When I boot the PC to TrueNAS OS, it doesn’t get an ipv4 address and apps don’t start. So I assume, that this comes from the failing boot drive? I just want to know whats happening. And why the drive is failing, although I don’t see any S.M.A.R.T.-errors.

So I reinstalled TrueNAS on the same SSD, loaded my saved configuration and now its running again; no errors. I don’t get it xD

Just a software thing? Will the drive errors come back, as the problem is maybe a real hardware failure?

So i followed the chart. I don’t get it :smiley: