Pool degraded after scrub, drive seems OK, can't find the cause

HI, and thanks for reading another “pool degraded” topic. I have 4x 4TB pool in raidz1 on the latest SCALE. One disk appears to be degraded after my scheduled scrub.

root@files[/home/admin]# zpool status -v pool4x4
  pool: pool4x4
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 0B in 04:21:03 with 0 errors on Sun Jun 23 13:34:30 2024
config:

        NAME                                      STATE     READ WRITE CKSUM
        pool4x4                                   DEGRADED     0     0     0
          raidz1-0                                DEGRADED     0     0     0
            b2c1bf84-b8a0-11ed-a86b-107b44191d69  ONLINE       0     0     0
            b2cb2916-b8a0-11ed-a86b-107b44191d69  FAULTED      0    25     0  too many errors
            b2d40fb7-b8a0-11ed-a86b-107b44191d69  ONLINE       0     0     0
            b2e0084f-b8a0-11ed-a86b-107b44191d69  ONLINE       0     0     0

errors: No known data errors

As you can see, there are only write errors. I am no expert on smartctl, but it says PASSED.

SMART overall-health self-assessment test result: PASSED

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      1699         -

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2          172  Command failed due to ICRC error
0x0002  2         1244  R_ERR response for data FIS
0x0003  2          177  R_ERR response for device-to-host data FIS
0x0004  2         1067  R_ERR response for host-to-device data FIS
0x0005  2         1822  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2         1822  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2         4584  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2         4606  Device-to-host register FISes sent due to a COMRESET
0x000b  2         1505  CRC errors within host-to-device FIS
0x000d  2         1384  Non-CRC errors within host-to-device FIS
0x000f  2         1025  R_ERR response for host-to-device data FIS, CRC
0x0012  2          480  R_ERR response for host-to-device non-data FIS, CRC

I assume the drive is OK. Any idead what the cause might be? HBA, cables, RAM, software? Or is it really the drive?

Could be a bad drive. You haven’t run a long test and short tests are almost useless.

And you didn’t include all the smart info…

But maybe it’s a cable issue. Hence crc errors.

Sata cables go bad. Or sometimes they just need to be reseated.

The full smart results may show UDMA errors which indicate a cable issue

2 Likes

Run a smart long test and post the result, then we can talk about facts. Also, please list your hardware in full.

OK, will do - see you in 8 hours. Thanks.

1 Like

Make sure you post the entire output of smartctl -x /dev/sdx, not just what you think we should see. Often people will not post important data because they think what they posted was good enough.

You could actually do that now vice waiting for the long test to complete. The data may be very obvious to pointing to a drive/cable issue.

4 Likes

Yes, I can do that of course.

root@files[/home/admin]# smartctl -x /dev/sdh  
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.29-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFZX-68AWUN0
Serial Number:    WD-WXB2DA17UAC3
LU WWN Device Id: 5 0014ee 214ee94f8
Firmware Version: 81.00B81
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Jun 23 17:03:57 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 247) Self-test routine in progress...
                                        70% of test remaining.
Total time to complete Offline 
data collection:                (43440) seconds.
Offline data collection
capabilities:                    (0x11) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 462) minutes.
SCT capabilities:              (0x303d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    0
  3 Spin_Up_Time            POS--K   222   220   021    -    3883
  4 Start_Stop_Count        -O--CK   100   100   000    -    277
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
  9 Power_On_Hours          -O--CK   098   098   000    -    1707
 10 Spin_Retry_Count        -O--CK   100   100   000    -    0
 11 Calibration_Retry_Count -O--CK   100   253   000    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    100
192 Power-Off_Retract_Count -O--CK   200   200   000    -    54
193 Load_Cycle_Count        -O--CK   200   200   000    -    499
194 Temperature_Celsius     -O---K   099   084   000    -    51
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   200   000    -    0
198 Offline_Uncorrectable   ----CK   100   253   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    172
200 Multi_Zone_Error_Rate   ---R--   100   253   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      6  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O      8  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa0-0xa7  GPL,SL  VS      16  Device vendor specific log
0xa8-0xb6  GPL,SL  VS       1  Device vendor specific log
0xb7       GPL,SL  VS      78  Device vendor specific log
0xbd       GPL,SL  VS       1  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xc1       GPL     VS      93  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      1699         -

Selective Self-tests/Logging not supported

SCT Status Version:                  3
SCT Version (vendor specific):       258 (0x0102)
Device State:                        DST executing in background (3)
Current Temperature:                    51 Celsius
Power Cycle Min/Max Temperature:     48/53 Celsius
Lifetime    Min/Max Temperature:     19/66 Celsius
Under/Over Temperature Limit Count:   0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/65 Celsius
Min/Max Temperature Limit:           -41/85 Celsius
Temperature History Size (Index):    478 (471)

Index    Estimated Time   Temperature Celsius
 472    2024-06-23 09:06    52  *********************************
 ...    ..( 34 skipped).    ..  *********************************
  29    2024-06-23 09:41    52  *********************************
  30    2024-06-23 09:42    51  ********************************
 ...    ..( 12 skipped).    ..  ********************************
  43    2024-06-23 09:55    51  ********************************
  44    2024-06-23 09:56    35  ****************
  45    2024-06-23 09:57    36  *****************
 ...    ..(  2 skipped).    ..  *****************
  48    2024-06-23 10:00    36  *****************
  49    2024-06-23 10:01    37  ******************
 ...    ..(  2 skipped).    ..  ******************
  52    2024-06-23 10:04    37  ******************
  53    2024-06-23 10:05    36  *****************
  54    2024-06-23 10:06    36  *****************
  55    2024-06-23 10:07     ?  -
  56    2024-06-23 10:08    36  *****************
 ...    ..(  3 skipped).    ..  *****************
  60    2024-06-23 10:12    36  *****************
  61    2024-06-23 10:13    37  ******************
 ...    ..(  4 skipped).    ..  ******************
  66    2024-06-23 10:18    37  ******************
  67    2024-06-23 10:19    38  *******************
 ...    ..( 15 skipped).    ..  *******************
  83    2024-06-23 10:35    38  *******************
  84    2024-06-23 10:36    39  ********************
 ...    ..(  9 skipped).    ..  ********************
  94    2024-06-23 10:46    39  ********************
  95    2024-06-23 10:47    40  *********************
 ...    ..(  8 skipped).    ..  *********************
 104    2024-06-23 10:56    40  *********************
 105    2024-06-23 10:57    41  **********************
 ...    ..(  8 skipped).    ..  **********************
 114    2024-06-23 11:06    41  **********************
 115    2024-06-23 11:07    42  ***********************
 ...    ..( 11 skipped).    ..  ***********************
 127    2024-06-23 11:19    42  ***********************
 128    2024-06-23 11:20    43  ************************
 ...    ..( 10 skipped).    ..  ************************
 139    2024-06-23 11:31    43  ************************
 140    2024-06-23 11:32    44  *************************
 ...    ..( 10 skipped).    ..  *************************
 151    2024-06-23 11:43    44  *************************
 152    2024-06-23 11:44    45  **************************
 ...    ..(  3 skipped).    ..  **************************
 156    2024-06-23 11:48    45  **************************
 157    2024-06-23 11:49     ?  -
 158    2024-06-23 11:50    45  **************************
 159    2024-06-23 11:51    45  **************************
 160    2024-06-23 11:52    45  **************************
 161    2024-06-23 11:53    46  ***************************
 162    2024-06-23 11:54    46  ***************************
 163    2024-06-23 11:55    47  ****************************
 ...    ..(  4 skipped).    ..  ****************************
 168    2024-06-23 12:00    47  ****************************
 169    2024-06-23 12:01    48  *****************************
 ...    ..( 50 skipped).    ..  *****************************
 220    2024-06-23 12:52    48  *****************************
 221    2024-06-23 12:53     ?  -
 222    2024-06-23 12:54    48  *****************************
 223    2024-06-23 12:55    49  ******************************
 ...    ..(  2 skipped).    ..  ******************************
 226    2024-06-23 12:58    49  ******************************
 227    2024-06-23 12:59    50  *******************************
 ...    ..(  4 skipped).    ..  *******************************
 232    2024-06-23 13:04    50  *******************************
 233    2024-06-23 13:05    51  ********************************
 ...    ..( 11 skipped).    ..  ********************************
 245    2024-06-23 13:17    51  ********************************
 246    2024-06-23 13:18    50  *******************************
 ...    ..( 35 skipped).    ..  *******************************
 282    2024-06-23 13:54    50  *******************************
 283    2024-06-23 13:55    49  ******************************
 ...    ..( 38 skipped).    ..  ******************************
 322    2024-06-23 14:34    49  ******************************
 323    2024-06-23 14:35     ?  -
 324    2024-06-23 14:36    49  ******************************
 ...    ..(  2 skipped).    ..  ******************************
 327    2024-06-23 14:39    49  ******************************
 328    2024-06-23 14:40    50  *******************************
 329    2024-06-23 14:41    50  *******************************
 330    2024-06-23 14:42    51  ********************************
 ...    ..(  3 skipped).    ..  ********************************
 334    2024-06-23 14:46    51  ********************************
 335    2024-06-23 14:47    52  *********************************
 ...    ..(  7 skipped).    ..  *********************************
 343    2024-06-23 14:55    52  *********************************
 344    2024-06-23 14:56    51  ********************************
 ...    ..(  2 skipped).    ..  ********************************
 347    2024-06-23 14:59    51  ********************************
 348    2024-06-23 15:00    50  *******************************
 ...    ..(  2 skipped).    ..  *******************************
 351    2024-06-23 15:03    50  *******************************
 352    2024-06-23 15:04    49  ******************************
 353    2024-06-23 15:05    49  ******************************
 354    2024-06-23 15:06    49  ******************************
 355    2024-06-23 15:07     ?  -
 356    2024-06-23 15:08    49  ******************************
 ...    ..( 13 skipped).    ..  ******************************
 370    2024-06-23 15:22    49  ******************************
 371    2024-06-23 15:23    50  *******************************
 ...    ..( 11 skipped).    ..  *******************************
 383    2024-06-23 15:35    50  *******************************
 384    2024-06-23 15:36    49  ******************************
 ...    ..(  2 skipped).    ..  ******************************
 387    2024-06-23 15:39    49  ******************************
 388    2024-06-23 15:40     ?  -
 389    2024-06-23 15:41    49  ******************************
 390    2024-06-23 15:42    48  *****************************
 ...    ..(  4 skipped).    ..  *****************************
 395    2024-06-23 15:47    48  *****************************
 396    2024-06-23 15:48    49  ******************************
 ...    ..(  6 skipped).    ..  ******************************
 403    2024-06-23 15:55    49  ******************************
 404    2024-06-23 15:56     ?  -
 405    2024-06-23 15:57    49  ******************************
 406    2024-06-23 15:58    48  *****************************
 407    2024-06-23 15:59    49  ******************************
 408    2024-06-23 16:00    49  ******************************
 409    2024-06-23 16:01    50  *******************************
 410    2024-06-23 16:02    50  *******************************
 411    2024-06-23 16:03    50  *******************************
 412    2024-06-23 16:04    51  ********************************
 ...    ..(  2 skipped).    ..  ********************************
 415    2024-06-23 16:07    51  ********************************
 416    2024-06-23 16:08    52  *********************************
 ...    ..(  5 skipped).    ..  *********************************
 422    2024-06-23 16:14    52  *********************************
 423    2024-06-23 16:15    53  **********************************
 ...    ..( 36 skipped).    ..  **********************************
 460    2024-06-23 16:52    53  **********************************
 461    2024-06-23 16:53    52  *********************************
 ...    ..(  9 skipped).    ..  *********************************
 471    2024-06-23 17:03    52  *********************************

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4             100  ---  Lifetime Power-On Resets
0x01  0x010  4            1707  ---  Power-on Hours
0x01  0x018  6     16020691145  ---  Logical Sectors Written
0x01  0x020  6        36263300  ---  Number of Write Commands
0x01  0x028  6     31882644424  ---  Logical Sectors Read
0x01  0x030  6        70878563  ---  Number of Read Commands
0x01  0x038  6      1850232704  ---  Date and Time TimeStamp
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4             476  ---  Spindle Motor Power-on Hours
0x03  0x010  4             437  ---  Head Flying Hours
0x03  0x018  4             554  ---  Head Load Events
0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4               0  ---  Read Recovery Attempts
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
0x03  0x038  4               0  ---  Number of Realloc. Candidate Logical Sectors
0x03  0x040  4              54  ---  Number of High Priority Unload Events
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               7  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              51  ---  Current Temperature
0x05  0x010  1              40  ---  Average Short Term Temperature
0x05  0x018  1              32  ---  Average Long Term Temperature
0x05  0x020  1              66  ---  Highest Temperature
0x05  0x028  1              23  ---  Lowest Temperature
0x05  0x030  1              53  ---  Highest Average Short Term Temperature
0x05  0x038  1              28  ---  Lowest Average Short Term Temperature
0x05  0x040  1              34  ---  Highest Average Long Term Temperature
0x05  0x048  1              32  ---  Lowest Average Long Term Temperature
0x05  0x050  4              70  ---  Time in Over-Temperature
0x05  0x058  1              65  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4            5997  ---  Number of Hardware Resets
0x06  0x010  4             188  ---  Number of ASR Events
0x06  0x018  4             172  ---  Number of Interface CRC Errors
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2          312  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2          312  R_ERR response for host-to-device data FIS
0x0005  2            5  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            5  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2          327  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2          339  Device-to-host register FISes sent due to a COMRESET
0x000b  2          306  CRC errors within host-to-device FIS
0x000d  2           11  Non-CRC errors within host-to-device FIS
0x000f  2          304  R_ERR response for host-to-device data FIS, CRC
0x0012  2            2  R_ERR response for host-to-device non-data FIS, CRC
0x8000  4         7238  Vendor specific

As @Stux mentioned, you have CRC Errors. This is most likely a data cable issue so record the value of 172, once the Long test has completed then replace the data cable to the drive. Make sure the CRC Errors do not increase. If they continue to increase then it could be another bad data cable but let’s say it is good, then it could be the drive electronics or the SATA controller. The data cable is what we see the most.

The only other thing I see that is likely going to be a problem, the drive temperature. You let it get to 65C, that is pretty damn hot for a hard drive.

Change the data cable, improve cooling performance.

Once you have changed out that data cable, run another SCRUB, if it comes back with no file errors then run zpool clear pool4x4 and that should fix you up for now.

3 Likes

:slight_smile:

3 Likes

As far as temps go, I actually know when this happened, the main fan died and the server was just in a scheduled scrub. But this was quite some time ago. As you can see from the current temp logs, I’m around the 50 degrees mark, most of the time well below that. As far as temps go, I am actually much more converned about the HBA (Lsi 9201-16i) since I don’t know how to check its temperatures.

Will wait for the test to complete and then change the cables. It’s one of those sff-8087 to 4x sata things, so I will be replacing cables for the entire pool. Thanks everyone.

I fret when my drives go over 36°C.

That’s the neat thing, you don’t! There is no probe there.

1 Like

HBA’s are designed for servers with airflow. In a “normal” PC or a lot of home servers they tend to overheat which is bad. Initially that leads to wierd errors with disks and this is followed (eventually) by HBA failure.
{This process is not considered as being data-safe}
:grinning:

Put an extra fan on / over the HBA. Given the disk temps its almost certain that the HBA is too hot

Your drives are (as others have said) too hot. 50 degrees is 10+ degrees over what I turn the server off and try and improve the cooling.

Also your warranty is (technically, maybe) no longer valid. Your drive has reached 66 degrees, the operating temp of those drives is 0 to 65. It gives WD an excuse to refuse warranty anyway.

4 Likes

Ever since i nearly burned my fingers on my first HBA, I am only running these things with an extra fan. Still, would be nice to have a temperature probe on the card.

My HDDs are in a Startech 4 drive hotswap bay. Pro: easy access. Con: it’s a hotbox. If you can recommend a better hotswap cage, let me know. Temps can always be better.

Those should have fans on the rear.

They do. Not very effective, it seems. When I open the server to change cables, I will have a look at the fan and maybe swap it.

Key word here is “Hot”.

But seriously, as said, these come with built in fans and you have four hard drives that are spinning at 5400 RPM (assuming they are all the same model drive), and as said above, too hot. Mine run 39C to 43C depending on how warm the upstairs gets and they are 7200 RPM. I turn off the air conditioning when no one occupies that space (visitors), so it gets a bit warm.

Airflow, take a look at your case fans, you should be sucking air in through the hard drive bay and out the rear of the case.

Since you have an HBA that generates a lot of heat, you should have a designed airflow path for that too, again, out the rear of the case.

Your case fans should be moving air in the correct direction and not competing against each other.

If you have a case with a bunch of holes (I have a perforated case for one NAS, looks nice) then you really need to design this correctly. Perforated cases are horrible by the way, and normally I would not have one but my new NAS has no spinning rust. Heat generation is significantly lower thankfully. I still needed to add a case fan. I could have left the case unmodified but that make the fan noisy and inefficient.

If all your case fans are trying to push air into the case, the fans on the hotswap cage will not work to move air and hot air would only go out the power supply (normal setup).

I didn’t mention airflow across any other heat generating items, keep those in mind as well.

I’m one of those people who will modify a case if I need to in order to improve airflow. Cutting holes, no problem. If you need advice, send some photos of the case (or just the website for the case) and a photo of the inside all cabled up. If you do this kind of thing, you must remove everything from the case, never cut metal with the electronics still inside.

The long SMART check finally finished, also PASSED.

root@files[/home/admin]# smartctl -x /dev/sdh
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.29-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFZX-68AWUN0
Serial Number:    WD-WXB2DA17UAC3
LU WWN Device Id: 5 0014ee 214ee94f8
Firmware Version: 81.00B81
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Jun 23 23:10:14 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (43440) seconds.
Offline data collection
capabilities:                    (0x11) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 462) minutes.
SCT capabilities:              (0x303d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    0
  3 Spin_Up_Time            POS--K   222   220   021    -    3883
  4 Start_Stop_Count        -O--CK   100   100   000    -    277
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
  9 Power_On_Hours          -O--CK   098   098   000    -    1713
 10 Spin_Retry_Count        -O--CK   100   100   000    -    0
 11 Calibration_Retry_Count -O--CK   100   253   000    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    100
192 Power-Off_Retract_Count -O--CK   200   200   000    -    54
193 Load_Cycle_Count        -O--CK   200   200   000    -    501
194 Temperature_Celsius     -O---K   109   084   000    -    41
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   200   000    -    0
198 Offline_Uncorrectable   ----CK   100   253   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    172
200 Multi_Zone_Error_Rate   ---R--   200   200   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      6  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O      8  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa0-0xa7  GPL,SL  VS      16  Device vendor specific log
0xa8-0xb6  GPL,SL  VS       1  Device vendor specific log
0xb7       GPL,SL  VS      78  Device vendor specific log
0xbd       GPL,SL  VS       1  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xc1       GPL     VS      93  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      1712         -
# 2  Short offline       Completed without error       00%      1699         -

Selective Self-tests/Logging not supported

SCT Status Version:                  3
SCT Version (vendor specific):       258 (0x0102)
Device State:                        Stand-by (1)
Current Temperature:                    41 Celsius
Power Cycle Min/Max Temperature:     41/53 Celsius
Lifetime    Min/Max Temperature:     19/66 Celsius
Under/Over Temperature Limit Count:   0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/65 Celsius
Min/Max Temperature Limit:           -41/85 Celsius
Temperature History Size (Index):    478 (359)

Index    Estimated Time   Temperature Celsius
 360    2024-06-23 15:13    47  ****************************
 ...    ..( 14 skipped).    ..  ****************************
 375    2024-06-23 15:28    47  ****************************
 376    2024-06-23 15:29    48  *****************************
 377    2024-06-23 15:30    48  *****************************
 378    2024-06-23 15:31    48  *****************************
 379    2024-06-23 15:32    49  ******************************
 380    2024-06-23 15:33    49  ******************************
 381    2024-06-23 15:34    48  *****************************
 ...    ..(  2 skipped).    ..  *****************************
 384    2024-06-23 15:37    48  *****************************
 385    2024-06-23 15:38    47  ****************************
 ...    ..(  2 skipped).    ..  ****************************
 388    2024-06-23 15:41    47  ****************************
 389    2024-06-23 15:42    46  ***************************
 390    2024-06-23 15:43    46  ***************************
 391    2024-06-23 15:44    46  ***************************
 392    2024-06-23 15:45    45  **************************
 393    2024-06-23 15:46    45  **************************
 394    2024-06-23 15:47    45  **************************
 395    2024-06-23 15:48    44  *************************
 396    2024-06-23 15:49    44  *************************
 397    2024-06-23 15:50    44  *************************
 398    2024-06-23 15:51    43  ************************
 399    2024-06-23 15:52    43  ************************
 400    2024-06-23 15:53    43  ************************
 401    2024-06-23 15:54    42  ***********************
 ...    ..(  2 skipped).    ..  ***********************
 404    2024-06-23 15:57    42  ***********************
 405    2024-06-23 15:58    41  **********************
 ...    ..(  3 skipped).    ..  **********************
 409    2024-06-23 16:02    41  **********************
 410    2024-06-23 16:03    50  *******************************
 411    2024-06-23 16:04    50  *******************************
 412    2024-06-23 16:05    51  ********************************
 ...    ..(  2 skipped).    ..  ********************************
 415    2024-06-23 16:08    51  ********************************
 416    2024-06-23 16:09    52  *********************************
 ...    ..(  5 skipped).    ..  *********************************
 422    2024-06-23 16:15    52  *********************************
 423    2024-06-23 16:16    53  **********************************
 ...    ..( 36 skipped).    ..  **********************************
 460    2024-06-23 16:53    53  **********************************
 461    2024-06-23 16:54    52  *********************************
 ...    ..( 45 skipped).    ..  *********************************
  29    2024-06-23 17:40    52  *********************************
  30    2024-06-23 17:41    51  ********************************
 ...    ..( 28 skipped).    ..  ********************************
  59    2024-06-23 18:10    51  ********************************
  60    2024-06-23 18:11    50  *******************************
 ...    ..( 13 skipped).    ..  *******************************
  74    2024-06-23 18:25    50  *******************************
  75    2024-06-23 18:26    51  ********************************
 ...    ..( 11 skipped).    ..  ********************************
  87    2024-06-23 18:38    51  ********************************
  88    2024-06-23 18:39    50  *******************************
 ...    ..( 38 skipped).    ..  *******************************
 127    2024-06-23 19:18    50  *******************************
 128    2024-06-23 19:19    49  ******************************
 ...    ..( 13 skipped).    ..  ******************************
 142    2024-06-23 19:33    49  ******************************
 143    2024-06-23 19:34    50  *******************************
 ...    ..( 45 skipped).    ..  *******************************
 189    2024-06-23 20:20    50  *******************************
 190    2024-06-23 20:21    49  ******************************
 ...    ..( 40 skipped).    ..  ******************************
 231    2024-06-23 21:02    49  ******************************
 232    2024-06-23 21:03    48  *****************************
 ...    ..( 49 skipped).    ..  *****************************
 282    2024-06-23 21:53    48  *****************************
 283    2024-06-23 21:54    47  ****************************
 ...    ..( 75 skipped).    ..  ****************************
 359    2024-06-23 23:10    47  ****************************

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4             100  ---  Lifetime Power-On Resets
0x01  0x010  4            1713  ---  Power-on Hours
0x01  0x018  6     16020691161  ---  Logical Sectors Written
0x01  0x020  6        36263300  ---  Number of Write Commands
0x01  0x028  6     31882649038  ---  Logical Sectors Read
0x01  0x030  6        70878583  ---  Number of Read Commands
0x01  0x038  6      1871832704  ---  Date and Time TimeStamp
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4             482  ---  Spindle Motor Power-on Hours
0x03  0x010  4             443  ---  Head Flying Hours
0x03  0x018  4             556  ---  Head Load Events
0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4               0  ---  Read Recovery Attempts
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
0x03  0x038  4               0  ---  Number of Realloc. Candidate Logical Sectors
0x03  0x040  4              54  ---  Number of High Priority Unload Events
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               7  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              40  ---  Current Temperature
0x05  0x010  1              44  ---  Average Short Term Temperature
0x05  0x018  1              32  ---  Average Long Term Temperature
0x05  0x020  1              66  ---  Highest Temperature
0x05  0x028  1              23  ---  Lowest Temperature
0x05  0x030  1              53  ---  Highest Average Short Term Temperature
0x05  0x038  1              28  ---  Lowest Average Short Term Temperature
0x05  0x040  1              34  ---  Highest Average Long Term Temperature
0x05  0x048  1              32  ---  Lowest Average Long Term Temperature
0x05  0x050  4              70  ---  Time in Over-Temperature
0x05  0x058  1              65  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4            5997  ---  Number of Hardware Resets
0x06  0x010  4             188  ---  Number of ASR Events
0x06  0x018  4             172  ---  Number of Interface CRC Errors
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2          312  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2          312  R_ERR response for host-to-device data FIS
0x0005  2            5  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            5  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2          327  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2          339  Device-to-host register FISes sent due to a COMRESET
0x000b  2          306  CRC errors within host-to-device FIS
0x000d  2           11  Non-CRC errors within host-to-device FIS
0x000f  2          304  R_ERR response for host-to-device data FIS, CRC
0x0012  2            2  R_ERR response for host-to-device non-data FIS, CRC
0x8000  4        29197  Vendor specific

It’s very late here, will be changing cables tomorrow and looking at fans. If in the meantime anything pops out from the log, please let me know.

Make sure the fans still working on the startech. And if it has any fan speed control switches make sure they’re set to full.

As others have said, if you don’t have a fan aimed at the hba’s heatsink, it’s probably not cooled properly.

I think you can determine its temperature with an LSI util.

But it does seem that the issue exists somewhere between the hba and hd drive controller boards. Possibly the cable, or possibly the hba just needs better cooling.

Also, I’m concerned about the number load cycles/power on retracts in the drives short life.

1 Like

Once you change cables you can zpool clear and zpool scrub again. Are you perhaps spinning down your drives? What is your ambient temperature?

1 Like

A hot (swap bay), not a (hot swap) bay :smile:

2 Likes
  • switched the cable
  • changed the hotswap fan and plugged it directly into the motherboard instead of the header on the Startech
  • reseated all the drives

No more errors. The alerts say that the drive is currently being resilvered. If anything else pops up, I’ll let you know, but looks good so far. Thanks to everyone.

My room temps range from 20C in the winter to 35C in the summer. I am currently sitting at 26C. The new fan made the temperature drop by … full 3 degrees. I added some photos below, so you can see (from top to bottom) the perforated front, the swap bay fan, and the HBA fan. My CPU temps are below 40C btw. I really think there is only so much you can do with this “hot hot” swap bay. BTW I noticed that the temps reported by smartctl are around 15 degrees higher than what I see on the webUI reportsdashboard/disk (~32C vs 48C), maybe someone should have a look at that.

To explain the powercycles: This pool with the 4 mechanical drives is used only 2-3 times per week: to make a backup of the SSD pool, for a scrub + snapshot, and to backup its contents to another server. For the rest of the week, I am spinning all those disks down. I know most of you would not recommended this for longevity reasons, but I have to do this for the noise or I will go insane. Besides, twice a week means 520 cycles over the expected 5 year lifespan, at which point I will hopefully be able to go full solid state.



1 Like