Help understand S.M.A.R.T log

One of my pools is degraded due to a faulty drive suddenly. I checked with smartctl -x /dev/sdf the S.M.A.R.T. values but need some help. It would be great if somebody can give a short yes/no/why answer to my thoughts and in which state my drive is. I know the SMART values but this output gives me headache. I know, it’s an very old drive (the last one of my first set), so replacing it is okay.

root@asgard[/mnt/folkwang/USERS/odin]# zpool status -v
  pool: folkwang
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in
a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub in progress since Thu Jul 17 06:24:45 2025
        29.7T / 29.7T scanned, 21.6T / 29.7T issued at 1.15G/s
        16.5M repaired, 72.68% done, 02:00:51 to go
config:
        NAME                                      STATE     READ WRITE CKSUM
        folkwang                                  DEGRADED     0     0
0
          raidz2-0                                DEGRADED     0     0
0
            839e92ba-203b-4c54-b720-867236b72d72  ONLINE       0     0
0
            844f478d-1f10-4379-afa5-9357b85a0c06  ONLINE       0     0
0
            b889d54e-8f56-4489-ae05-6c0c60db7d67  ONLINE       0     0
0
            2d00b6d5-b64f-46da-8e3a-c7431d98a68c  ONLINE       0     0
0
            d8819c42-41ee-4bef-aee6-b74cc7a21bd1  FAULTED    787     0
0  too many errors
            5ce42eae-c01a-4959-95c9-b4ef6f39aee2  ONLINE       0     0
0
            28083615-7136-4551-8600-cff938193015  ONLINE       0     0
0
            6e4853d7-fec5-4ba5-9bf6-36f78f9f5494  ONLINE       0     0
0
errors: No known data errors

Faulted with 787 read errors > too many errors. Where is the threshold for TrueNAS to set the “too many” flag? And is my RAIDZ2 pool currently safe with another disk as parity (I guess “faulted” means out ot the pool?)
Can TrueNAS repair that or do I need a new disk and resilvering in general?

Anything useful from here on beside general disk information?

root@asgard[/mnt/folkwang/USERS/odin]# smartctl -x /dev/sdf
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.15-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD60EFRX-68L0BN1
Serial Number:    WD-WX31DA5LHRFN
LU WWN Device Id: 5 0014ee 20cf518f7
Firmware Version: 82.00A82
User Capacity:    6,001,175,126,016 bytes [6.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5700 rpm
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Jul 17 13:09:40 2025 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has
ever
                                        been run.
Total time to complete Offline
data collection:                ( 1544) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 669) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x303d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

Interesting part but need some explanations.
In general: if my VALUE is < TRESH, my disk is dying.
Regarding the pool status with 787 read errors I would assume that ID 1 (Raw_Read_Error_Rate) with 36 is the cause. The classics #5, #196, #197, #198, #200 are all zero.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    36
  3 Spin_Up_Time            POS--K   198   198   021    -    9058
  4 Start_Stop_Count        -O--CK   099   099   000    -    1900
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   100   253   000    -    0
  9 Power_On_Hours          -O--CK   001   001   000    -    75556
10 Spin_Retry_Count        -O--CK   100   100   000    -    0
11 Calibration_Retry_Count -O--CK   100   100   000    -    0
12 Power_Cycle_Count       -O--CK   100   100   000    -    271
192 Power-Off_Retract_Count -O--CK   200   200   000    -    48
193 Load_Cycle_Count        -O--CK   195   195   000    -    15905
194 Temperature_Celsius     -O---K   112   108   000    -    40
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   200   000    -    0
198 Offline_Uncorrectable   ----CK   100   253   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   100   253   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

Can I find something interesting here?

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      6  Ext. Comprehensive SMART error log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa0-0xa7  GPL,SL  VS      16  Device vendor specific log
0xa8-0xb6  GPL,SL  VS       1  Device vendor specific log
0xb7       GPL,SL  VS      40  Device vendor specific log
0xbd       GPL,SL  VS       1  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xc1       GPL     VS      93  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer
SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 54 (device log contains only the most recent 24 errors)
        CR     = Command Register
        FEATR  = Features Register
        COUNT  = Count (was: Sector Count) Register
        LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
        LH     = LBA High (was: Cylinder High) Register    ]   LBA
        LM     = LBA Mid (was: Cylinder Low) Register      ] Register
        LL     = LBA Low (was: Sector Number) Register     ]
        DV     = Device (was: Device/Head) Register
        DC     = Device Control Register
        ER     = Error register
        ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 54 [5] occurred at disk power-on lifetime: 10018 hours (417 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.
  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 02 b7 41 71 c0 40 00  Error: WP at LBA = 0x2b74171c0 = 11664454080
  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 01 60 00 00 00 02 b7 40 f8 98 40 00  6d+12:16:05.866  WRITE FPDMA QUEUED
  60 00 10 00 18 00 02 ba 60 f6 10 40 00  6d+12:16:05.865  READ FPDMA QUEUED
  60 00 10 00 10 00 02 ba 60 f4 10 40 00  6d+12:16:05.865  READ FPDMA QUEUED
  60 00 10 00 00 00 00 00 00 0a 10 40 00  6d+12:16:05.865  READ FPDMA QUEUED
  60 00 b8 00 08 00 02 b7 41 71 b0 40 00  6d+12:16:05.865  READ FPDMA QUEUED
Error 53 [4] occurred at disk power-on lifetime: 10018 hours (417 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.
  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 02 b7 41 6a b8 40 00  Error: WP at LBA = 0x2b7416ab8 = 11664452280
  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 01 60 00 08 00 02 b7 40 f5 c8 40 00  6d+12:15:58.878  WRITE FPDMA QUEUED
  60 07 e8 00 00 00 02 b7 41 69 c8 40 00  6d+12:15:58.877  READ FPDMA QUEUED
  61 00 10 00 00 00 02 ba 60 f6 10 40 00  6d+12:15:58.877  WRITE FPDMA QUEUED
  61 01 60 00 00 00 02 b7 40 f4 60 40 00  6d+12:15:58.876  WRITE FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 00  6d+12:15:58.875  READ LOG EXT
Error 52 [3] occurred at disk power-on lifetime: 10018 hours (417 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.
  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 02 b7 41 64 20 40 00  Error: WP at LBA = 0x2b7416420 = 11664450592
  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 01 60 00 08 00 02 b7 40 f4 60 40 00  6d+12:15:51.890  WRITE FPDMA QUEUED
  60 07 e8 00 00 00 02 b7 41 61 e0 40 00  6d+12:15:51.889  READ FPDMA QUEUED
  61 00 10 00 00 00 02 ba 60 f4 10 40 00  6d+12:15:51.889  WRITE FPDMA QUEUED
  61 01 60 00 00 00 02 b7 40 f2 f8 40 00  6d+12:15:51.888  WRITE FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 00  6d+12:15:51.888  READ LOG EXT
Error 51 [2] occurred at disk power-on lifetime: 10018 hours (417 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.
  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 02 b7 41 5d 68 40 00  Error: UNC at LBA = 0x2b7415d68
= 11664448872
  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 07 e8 00 00 00 02 b7 41 59 f8 40 00  6d+12:15:44.902  READ FPDMA QUEUED
  61 00 10 00 00 00 00 00 00 0a 10 40 00  6d+12:15:44.902  WRITE FPDMA QUEUED
  61 01 60 00 00 00 02 b7 40 f1 90 40 00  6d+12:15:44.901  WRITE FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 00  6d+12:15:44.900  READ LOG EXT
  61 01 60 00 00 00 02 b7 40 f1 90 40 00  6d+12:15:40.845  WRITE FPDMA QUEUED
Error 50 [1] occurred at disk power-on lifetime: 10018 hours (417 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.
  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 02 b7 41 52 10 40 00  Error: WP at LBA = 0x2b7415210 = 11664445968
  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 01 60 00 00 00 02 b7 40 f1 90 40 00  6d+12:15:40.845  WRITE FPDMA QUEUED
  60 07 e8 00 08 00 02 b7 41 52 10 40 00  6d+12:15:40.844  READ FPDMA QUEUED
  61 01 60 00 00 00 02 b7 40 f0 28 40 00  6d+12:15:40.844  WRITE FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 00  6d+12:15:40.843  READ LOG EXT
  61 01 60 00 08 00 02 b7 40 f0 28 40 00  6d+12:15:33.859  WRITE FPDMA QUEUED
Error 49 [0] occurred at disk power-on lifetime: 10018 hours (417 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.
  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 02 b7 41 4a 48 40 00  Error: WP at LBA = 0x2b7414a48 = 11664443976
  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 01 60 00 08 00 02 b7 40 f0 28 40 00  6d+12:15:33.859  WRITE FPDMA QUEUED
  60 00 10 00 18 00 02 ba 60 f6 10 40 00  6d+12:15:33.858  READ FPDMA QUEUED
  60 00 10 00 10 00 02 ba 60 f4 10 40 00  6d+12:15:33.858  READ FPDMA QUEUED
  60 00 10 00 08 00 00 00 00 0a 10 40 00  6d+12:15:33.858  READ FPDMA QUEUED
  60 07 e8 00 00 00 02 b7 41 4a 28 40 00  6d+12:15:33.858  READ FPDMA QUEUED
Error 48 [23] occurred at disk power-on lifetime: 10018 hours (417 days +
10 hours)
  When the command that caused the error occurred, the device was active or idle.
  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 02 b7 41 43 08 40 00  Error: UNC at LBA = 0x2b7414308
= 11664442120
  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 07 e8 00 08 00 02 b7 41 42 38 40 00  6d+12:15:26.871  READ FPDMA QUEUED
  61 00 10 00 00 00 02 ba 60 f6 10 40 00  6d+12:15:26.871  WRITE FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 00  6d+12:15:26.870  READ LOG EXT
  61 00 10 00 08 00 02 ba 60 f6 10 40 00  6d+12:15:19.885  WRITE FPDMA QUEUED
  60 07 e8 00 00 00 02 b7 41 3a 50 40 00  6d+12:15:19.884  READ FPDMA QUEUED
Error 47 [22] occurred at disk power-on lifetime: 10018 hours (417 days +
10 hours)
  When the command that caused the error occurred, the device was active or idle.
  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 02 b7 41 3c d0 40 00  Error: WP at LBA = 0x2b7413cd0 = 11664440528
  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 10 00 08 00 02 ba 60 f6 10 40 00  6d+12:15:19.885  WRITE FPDMA QUEUED
  60 07 e8 00 00 00 02 b7 41 3a 50 40 00  6d+12:15:19.884  READ FPDMA QUEUED
  61 01 60 00 00 00 02 b7 40 ed 58 40 00  6d+12:15:19.883  WRITE FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 00  6d+12:15:19.883  READ LOG EXT
  60 07 e8 00 08 00 02 b7 41 32 68 40 00  6d+12:15:15.382  READ FPDMA QUEUED
SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      9440
   -
# 2  Short offline       Completed without error       00%      8697
   -
# 3  Short offline       Completed without error       00%      7978
   -
# 4  Short offline       Completed without error       00%      7235
   -
# 5  Short offline       Completed without error       00%      6565
   -
# 6  Short offline       Completed without error       00%      5822
   -
# 7  Short offline       Completed without error       00%      5078
   -
# 8  Short offline       Completed without error       00%      4359
   -
# 9  Short offline       Completed without error       00%      3616
   -
#10  Short offline       Completed without error       00%      2896
   -
#11  Short offline       Completed without error       00%      2153
   -
#12  Short offline       Completed without error       00%      1410
   -
#13  Short offline       Completed without error       00%       691
   -
#14  Short offline       Completed without error       00%     65483
   -
#15  Short offline       Completed without error       00%     64764
   -
#16  Short offline       Completed without error       00%     64022
   -
#17  Short offline       Completed without error       00%     63327
   -
#18  Short offline       Completed without error       00%     62584
   -
SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
SCT Status Version:                  3
SCT Version (vendor specific):       258 (0x0102)
Device State:                        Active (0)
Current Temperature:                    40 Celsius
Power Cycle Min/Max Temperature:     37/44 Celsius
Lifetime    Min/Max Temperature:      2/44 Celsius
Under/Over Temperature Limit Count:   0/0
Vendor specific:
00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/60 Celsius
Min/Max Temperature Limit:           -41/85 Celsius
Temperature History Size (Index):    478 (338)
Index    Estimated Time   Temperature Celsius
339    2025-07-17 05:12    42  ***********************
...    ..(  3 skipped).    ..  ***********************
343    2025-07-17 05:16    42  ***********************
344    2025-07-17 05:17    41  **********************
...    ..( 28 skipped).    ..  **********************
373    2025-07-17 05:46    41  **********************
374    2025-07-17 05:47    40  *********************
...    ..( 21 skipped).    ..  *********************
396    2025-07-17 06:09    40  *********************
397    2025-07-17 06:10    41  **********************
...    ..( 12 skipped).    ..  **********************
410    2025-07-17 06:23    41  **********************
411    2025-07-17 06:24    40  *********************
...    ..( 30 skipped).    ..  *********************
442    2025-07-17 06:55    40  *********************
443    2025-07-17 06:56    38  *******************
...    ..( 57 skipped).    ..  *******************
  23    2025-07-17 07:54    38  *******************
  24    2025-07-17 07:55    39  ********************
...    ..( 21 skipped).    ..  ********************
  46    2025-07-17 08:17    39  ********************
  47    2025-07-17 08:18    40  *********************
...    ..(  5 skipped).    ..  *********************
  53    2025-07-17 08:24    40  *********************
  54    2025-07-17 08:25    41  **********************
...    ..( 11 skipped).    ..  **********************
  66    2025-07-17 08:37    41  **********************
  67    2025-07-17 08:38    42  ***********************
...    ..( 37 skipped).    ..  ***********************
105    2025-07-17 09:16    42  ***********************
106    2025-07-17 09:17    43  ************************
...    ..( 81 skipped).    ..  ************************
188    2025-07-17 10:39    43  ************************
189    2025-07-17 10:40    44  *************************
...    ..( 74 skipped).    ..  *************************
264    2025-07-17 11:55    44  *************************
265    2025-07-17 11:56    43  ************************
...    ..( 61 skipped).    ..  ************************
327    2025-07-17 12:58    43  ************************
328    2025-07-17 12:59    42  ***********************
...    ..(  9 skipped).    ..  ***********************
338    2025-07-17 13:09    42  ***********************
SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)
Device Statistics (GP/SMART Log 0x04) not supported
Pending Defects log (GP Log 0x0c) not supported
SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            1  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2          297  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x8000  4       569931  Vendor specific

Best next steps? directly replacing the disk?

In this state, your pool can handle the fail of another disk, but with 2 pool Is lost. First advice: check and refresh backups while troubleshooting.
Then, IMHO in your place

  • i would shutdown the server and reseat/change sata cable on this disk before Power on
  • i would launch a smart long test on that disk, and check your smart schedule because seems you are only performing shorts (every month?). This should confirm if the disk Is faulty or not

Backups are done to external drives with schedules.
I set up the NAS a few days ago and I’m currently setting up all tasks (long and short smart, scrubs, Snapshots). I copied all my data and started the scrub just out of schedule to be sure.

The question is: is the disk def. dying or can it be a cable or something else?

In my little experience i don’t see anything obvious that point on a failure, so Is worth to check the cables.

Also usefull if you share your complete Hw spec, especially know if this disk Is connected directly to the mainboard - if an hba Is involved - or worst a cheap SATA multiplier

Why on earth did you break up the smartctl output into three blocks? That, to say the least, doesn’t make it easy to follow what’s going on there. But one thing that stands out to me is that that drive has never had a long SMART self-test (at least in a very long time). Those should be scheduled periodically, just like the short ones.

Here are my specs:

  • Mainboard: ASRock B850M Pro-A
  • CPU: Ryzen 5 7600
  • RAM: 4x 32 GB DDR5 5200 at 3600 non-ECC
  • HBA: Broadcom 9500-8i
  • PSU: bequiet PurePower 13M 750W
  • HDD: 8x 6TB WD Red Plus in RAIDZ2 (WD60EFRX, EFZX and EFPX)
  • Boot: 2x 250GB WD Red SN700 (mirrored)
  • Backup-Pool: 2x 10TB HGST mirrored
  • Case is a Silverstone CS382 with 2 x4-backplanes

Here are the 8 WD Reds:

  1. nvme1n1 WD Red SN700 250GB 232.89 GiB
  2. nvme0n1 WD Red SN700 250GB 232.89 GiB
  3. sda WDC_WD60EFRX-68L0BN1 5.46 TiB
  4. sdb WDC_WD60EFZX-68B3FN0 5.46 TiB
  5. sdc WDC_WD60EFPX-68C5ZN0 5.46 TiB
  6. sdd WDC_WD60EFPX-68C5ZN0 5.46 TiB
  7. sde WDC_WD60EFRX-68L0BN1 5.46 TiB
    8. sdf WDC_WD60EFRX-68L0BN1 5.46 TiB < faulty one
  8. sdg WDC_WD60EFPX-68C5ZN0 5.46 TiB
  9. sdh WDC_WD60EFZX-68B3FN0 5.46 TiB

Look in my signature for Drive Troubleshooting Flowcharts. This should help you quite a bit. I need to look at what you posted to see anything obvious wrong.

I agree with @dan you need to run a SMART Long test. I suspect your drive is failing given the over 75500 hours of power on hours. I suspect a Long test will fail.

The flowcharts would tell you the same thing.

Edit: recommend daily short tests and weekly long tests since your drives are relatively small so they would take 11 hours and 20 minutes given no problems. At a minimum, monthly long tests.

short tests are scheduled weekly, and long tests every two weeks now. But I will change that. And I will replace the disk and make an long test offline.

Short tests are typically 2 minutes to run so it makes sense to run it daily. Long tests once every two weeks is fine if you plan to space them out one per day for example.

Two questions:

  1. Can (or should) I run the long test on all drives simultaneously? Is there a performance issue for the system (shares and write/read access)?
  2. I’m searching for a new drive and want to upgrade my pool to 12 or 18 TB per disk. How does the Seagate EXOS perform in TrueNAS? I’m a WD Red fanboy for years and right now I can choose between 18 TB WD Red Pro for 460€ per disk or Seagate EXOS for 260€ per disk. So I can get almost 2 EXOS for one Red Pro.

Have you ever run a long test though? There are more than two weeks worth of short tests in that log (if they are run once daily) and not a single long test to be seen.

I suspect there’s a small performance penalty, but I still recommend running it on all drives in the same VDEV at once. May want to avoid doing it while also running a scrub though.

So it was one of my first drives ever. I have no clue, why no long SMART test is shown because my old system was running it monthly. I’m running it right now (80% pending).

There are a few factors to consider:

  1. Do your drives remain cool when scrubbing the pool? If yes, then no temperature issue to running a SMART Long test on all at once.
  2. Do you have a very active system? Reading/writing in a office/corporate environment? If yes, then I’d space them out, one or two per day. However a SMART test always is the lowest priority so if your system needs to read/write data, the SMART test stops until it has free drive time. This could add a few minutes to your overall Long test time.

With your drives being about 11.5 hours for a Long test (without errors), I would recommend one or two a day. There is no need to run them all at once, but if you have no cooling issues, it comes down to personal preference.

Let me point out that the one drive you provided data on was running at 40C, it’s highest lifetime value was 44C. This is not bad. But you should examine all your drives. just because a drive “can” be operated at 50C without voiding the warranty, doesn’t mean it is good to run it there all the time. You will find out that we prefer below 40C, my personal value id below 45C. I’m fine if it peeks at 44C, but at 45C I start to question if I have a cooling issue (bad fan).

That was a bit more than what you asked for.

Running long tests splitted over more days… That will be a heck of task editing for scrubs and short tests so that they not interfer with the long test.

And the temps: yeah, my old system was a QNAP and the cooling was not so great. But since my TrueNAS tower runs, the temps are at 32-39 C. Only one is currently at 42 C, but that is the lowest one where the power supply is placed behind with not so much space. My fans run at 80% const. Maybe I set 100% during the next reboot.

A SCRUB by default is launched once a month, on Sunday morning. Schedule your long tests around that, only a few drives it will be easy.

But if you want extra easy, you can run a script called Multi-Report (in the recourses and linked in my signature) which provides you a nice report daily, a backup of your TrueNAS configuration file every Monday, it schedules all your testing for you, and well, that is about it. Quite a few people use it. Maybe do a little research on it and if it looks like something you want, make sure you read the Quick Start Guide.

If you want just drive test scheduling then all you would need is the Drive-Selftest script.

Let me toss this in here as well, TrueNAS currently does not test NVMe drives. They can be scheduled but no test actually runs. My little scripts take care of that, it will run SMART tests on NVMe drives. I hope TrueNAS starts testing NVMe drives in the next major release.

1 Like

Is it possible that the SMART results got cleared or the errors can vanish? The long test gave a Multi Zone Error (200) and today its gone.
Either way, I’m replacing the drive.

Yes, it is a “RATE” value and these go up and down. This alone usually does not signify a failure, however when coupled with other items, it can indicate pending doom. For example you have ID 1, Raw Read Error Rate on a WD drive. This is normally a “0” value, unlike a Seagate drive. Because you have both ID 1 and ID 200 indications, I too would replace the drive before it fails completely. With over 75,000 hours on it, it is probably time.

1 Like

lol… TrueNAS switched the drive names. I checked /dev/sdf and now it became /dev/sdg - and vice versa. On top: sdg has the same age / running cycle like sdf but not a single issue. Is that a common behavior for TrueNAS?

Yes. This is not TrueNAS actually but the underlying OS potentially reshuffling drive letters/numbers at reboot. Always track drives by serial number.

1 Like