Seagate Exos X14 10TB Drives Degraded on TrueNAS - Need Help

periscope3746 · March 28, 2025, 10:57am

Hello everyone,

I’m currently experiencing an issue with my TrueNAS system. I have two Seagate Exos X14 10TB drives configured as a mirror, and after booting up, the system shows a “degraded” status. Interestingly, after performing a reboot, the problem seems to disappear, and the drives operate normally.

However, I noticed another issue when moving large files — the drives tend to disappear, and the degraded status reappears. I’m unsure whether this indicates a hardware failure or a configuration issue.

Has anyone else encountered this kind of behavior with Seagate Exos or other large-capacity drives on TrueNAS? I would greatly appreciate any suggestions on troubleshooting steps or potential fixes.

zpool status

pool: nas_pool
state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid. Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
  scan: resilvered 9.10M in 00:00:01 with 0 errors on Thu Mar 27 16:49:21 2025
config:

        NAME                                      STATE     READ WRITE CKSUM
        nas_pool                                  DEGRADED     0     0     0
          mirror-0                                DEGRADED     0     0     0
            2744419453325156401                   UNAVAIL      0     0     0  was /dev/disk/by-partuuid/b9801db4-4892-431a-98ab-c53e23e3d819
            e487408c-2ae0-43ed-9a13-d3ada4caa9e1  ONLINE       0     0     0
        logs
          mirror-5                                ONLINE       0     0     0
            nvme0n1p5                             ONLINE       0     0     0
            nvme1n1p5                             ONLINE       0     0     0
        cache
          nvme0n1p6                               ONLINE       0     0     0
          nvme1n1p6                               ONLINE       0     0     0

errors: No known data errors

Thanks in advance for your help!

periscope3746 · March 28, 2025, 11:05am

smartctl -a /dev/sda

smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.44-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Exos X14
Device Model:     ST10000NM0568
Serial Number:
LU WWN Device Id:
Firmware Version: SS02
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5671
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Mar 28 20:02:50 2025 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  575) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (1010) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x70bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   083   064   044    Pre-fail  Always       -       221463524
  3 Spin_Up_Time            0x0003   091   091   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       195
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   068   060   045    Pre-fail  Always       -       6489855
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       1940
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       166
 18 Head_Health             0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   059   052   040    Old_age   Always       -       41 (Min/Max 25/41)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       193
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       866
194 Temperature_Celsius     0x0022   041   048   000    Old_age   Always       -       41 (0 17 0 0 0)
195 Hardware_ECC_Recovered  0x001a   015   008   000    Old_age   Always       -       221463524
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Pressure_Limit          0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       1692h+25m+14.469s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       4527105267
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       4290358909

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Interrupted (host reset)      50%      1925         -
# 2  Short offline       Completed without error       00%      1833         -
# 3  Short offline       Completed without error       00%      1807         -
# 4  Extended offline    Interrupted (host reset)      80%       536         -
# 5  Short offline       Completed without error       00%        14         -
# 6  Short offline       Completed without error       00%        14         -
# 7  Short offline       Completed without error       00%        14         -
# 8  Extended offline    Aborted by host               90%        13         -
# 9  Short offline       Completed without error       00%        13         -
#10  Short offline       Completed without error       00%        12         -
#11  Short offline       Completed without error       00%         1         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try 'smartctl -x' for more

Fleshmauler · March 28, 2025, 11:25am

Mind re-running with

smartctl -a -v 1,raw48:54 /dev/sda -v 7,raw48:54 -v 195,raw48:54

Seagate needs some additional parameters, otherwise the values aren’t easy to read for smartctl.

Otherwise, rebooting or running zpool clear will remove the errors/warning, but it isn’t actually fixing anything. Have you tried running a scurb? Did have you tried reseating the wires? Any hardware details may also help; motherboard, how the drives are connection (directly to motherboard or through HBA, etc.), etc. Expand my signature for example of what would generally be helpful.

periscope3746 · March 28, 2025, 11:46am

smartctl -a -v 1,raw48:54 /dev/sda -v 7,raw48:54 -v 195,raw48:54

smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.44-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Exos X14
Device Model:     ST10000NM0568
Serial Number:
LU WWN Device Id:
Firmware Version: SS02
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5671
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Mar 28 20:45:05 2025 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  575) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (1010) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x70bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   066   064   044    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   091   091   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       195
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   068   060   045    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       1941
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       166
 18 Head_Health             0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   055   052   040    Old_age   Always       -       45 (Min/Max 25/45)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       193
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       866
194 Temperature_Celsius     0x0022   045   048   000    Old_age   Always       -       45 (0 17 0 0 0)
195 Hardware_ECC_Recovered  0x001a   040   008   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Pressure_Limit          0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       1693h+07m+29.158s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       4553904907
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       4290365237

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Interrupted (host reset)      50%      1925         -
# 2  Short offline       Completed without error       00%      1833         -
# 3  Short offline       Completed without error       00%      1807         -
# 4  Extended offline    Interrupted (host reset)      80%       536         -
# 5  Short offline       Completed without error       00%        14         -
# 6  Short offline       Completed without error       00%        14         -
# 7  Short offline       Completed without error       00%        14         -
# 8  Extended offline    Aborted by host               90%        13         -
# 9  Short offline       Completed without error       00%        13         -
#10  Short offline       Completed without error       00%        12         -
#11  Short offline       Completed without error       00%         1         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try 'smartctl -x' for more

periscope3746 · March 28, 2025, 11:52am

I have a QNAP TS-253D with two Seagate Exos X14 10 TB drives and a Crucial T500 500 GB SSD that is partitioned into multiple sections for different purposes: one for the OS, one for SLOG, one for L2ARC Cache, and another for containers, along with 16 GB DDR4 RAM.

I’ve already tried rebooting and running zpool clear, but that only temporarily resolves the issue. The problem comes back regularly, suggesting it’s not just a display issue.

The cables are securely connected, and there are no loose connections. A scrub is run weekly, but it doesn’t fix the problem permanently.

dan · March 28, 2025, 11:58am

This isn’t your problem, but it’s a disastrously poor configuration, even leaving aside that you probably don’t have any use for SLOG and can’t effectively use L2ARC. And that SSD is completely unsuitable for SLOG.

periscope3746 · March 28, 2025, 12:25pm

I understand that the configuration may not be ideal, and I appreciate your feedback, but I’m not focused on the SSD setup at the moment since it’s not directly related to the issue I’m experiencing. My primary concern is the persistent errors that keep appearing in my ZFS pool, which zpool clear and a reboot only temporarily fix, but don’t resolve the underlying issue.

I’m aware that using an SSD for SLOG might not be the best choice and that L2ARC may not be fully optimized with my 16 GB of RAM (I realize that L2ARC typically requires more memory for optimal performance). However, for now, my goal is to address the errors and performance issues within the ZFS pool. The issues I’m facing seem to be hardware or disk-related rather than configuration-related.

I’ll look into improving the SSD setup later, but for now, any advice on fixing the ZFS errors would be greatly appreciated. Specifically, I’m looking for guidance on interpreting the SMART data, checking the disk health, and investigating if there’s any underlying hardware issue causing these warnings.

Thanks again for your input!

yorick · March 28, 2025, 12:36pm

What is that using for a controller? How is heat?

It is extremely likely this is a hardware / driver issue at its core.

periscope3746 · March 28, 2025, 12:40pm

The CPU is running at 50°C, and the hard drive is at 45°C. The controller for the HDD is the standard one provided by QNAP.

It does seem very likely that this issue could be hardware or driver-related, as you mentioned. However, it’s worth noting that the drives ran without any issues in a mirror setup under Proxmox prior to this. If anyone has suggestions on how to further troubleshoot this, I’d appreciate it!

Protopia · March 28, 2025, 1:45pm

Not a cause of the problem but you should schedule weekly SMART short tests and monthly long tests, and implement @joeschmuck’s Multi-Report script so you get immediate warning of errors.

But I do note that you have never had a successful SMART long test run to completion.

Fleshmauler · March 29, 2025, 5:47am

The drive itself looks fine to me, but could benifit from fully completing a long test to confirm. No QNAP experience - donno what the internals look like, if it uses port multipliers or other kinds of things that would be considered as jank. Maybe drive just needs a reseat?

I’d investigate this as a hardware fault personally, not as an hdd fault.

If you want to confirm beyond doubt & got a spare pc lying around, see if you can replicate when you connect drives to it & a temporary truenas boot. It’ll at least confirm beyond doubt what should be looked at.

sfstruenas · May 13, 2025, 10:23am

I have a QNAP TS-253D. QNAP-INTEL have limited the J4125 cpu memory map to 8GB Ram. Yes, you can install more RAM, but going beyond 8 GB will lead to random errors. I want back to 8GB and my random errors stopped. I use 2 Crucial 4 GB RAM (non-ECC) sticks in the two memory slots. I have 2 ST16000NM001G-2KK103 Seagate 16TB drives. Hope this helps.