Strange smartctl long test fail and ZFS mirror correlation

I have a system (full details in my sig) with 41 SAS SSDs in it, all of the same vintage, reclaimed from a Netapp system.

I do daily smartctl short tests and weekly long tests

All the disks always pass their short tests, but four disks in my system that are in mirrors (two mirrors, one each for boot-pool and apps use) always fail.

There are another two disks in a mirror, basically used as scratch, that also always pass.

It seems unlikely to me that I picked four bad disks out of 41 to put in mirrors. Is there some interaction between boot and apps use, or the fact that they’re mirrors, that is making the long test fail?

The long test always fails “in segment 8” with no further information.

Any ideas?

Post the output of one of the drives smartctl -x /dev/xxx
Maybe you are interpreting something wrong.

However if it is failing SMART Long self-test, then the drive has failed. SMART is all internal to the drive and has no correlation to anything outside of it’s little box of hardware and spinning disks.

Here’s the output from one of the drives. FYI, I’m running multireport daily to process the results of the smartctl tests:

output from 'smartctl -x /dev/sdc'
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.29-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               NETAPP
Product:              X439_S16331T6AMD
Revision:             NA04
Compliance:           SPC-4
User Capacity:        1,600,321,314,816 bytes [1.60 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate:        Solid State Device
Form Factor:          2.5 inches
Logical Unit id:      0x5002538a75801d80
Serial number:        S20JNWAG800472
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Tue Jul  2 08:26:47 2024 BST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
Read Cache is:        Enabled
Writeback Cache is:   Disabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Percentage used endurance indicator: 0%
Current Drive Temperature:     21 C
Drive Trip Temperature:        60 C

Manufactured in week 31 of year 2015
Accumulated start-stop cycles:  256
Specified load-unload count over device lifetime:  0
Accumulated load-unload cycles:  0
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0     278049.709           0
write:         0        0         0         0          0     128444.062           0
verify:        0        0         0         0          0     504197.952           0

Non-medium error count:        4

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -     315                 - [-   -    -]
# 2  Background long   Failed in segment -->       8     291                 - [-   -    -]
# 3  Background short  Completed                   -     267                 - [-   -    -]
# 4  Background short  Completed                   -     243                 - [-   -    -]
# 5  Background short  Completed                   -     219                 - [-   -    -]
# 6  Background short  Completed                   -     195                 - [-   -    -]
# 7  Background short  Completed                   -     171                 - [-   -    -]
# 8  Background short  Completed                   -     147                 - [-   -    -]
# 9  Background short  Completed                   -     123                 - [-   -    -]
#10  Background short  Completed                   -      76                 - [-   -    -]
#11  Background short  Completed                   -      52                 - [-   -    -]
#12  Background short  Completed                   -      28                 - [-   -    -]
#13  Background short  Completed                   -       4                 - [-   -    -]
#14  Background short  Completed                   -   65516                 - [-   -    -]
#15  Background long   Failed in segment -->       8   65497                 - [-   -    -]
#16  Background long   Failed in segment -->       8   65497                 - [-   -    -]
#17  Background long   Failed in segment -->       8   65494                 - [-   -    -]
#18  Background short  Completed                   -   65468                 - [-   -    -]
#19  Background short  Completed                   -   65444                 - [-   -    -]
#20  Background short  Completed                   -   65420                 - [-   -    -]

Long (extended) Self-test duration: 3600 seconds [60.0 minutes]

Background scan results log
  Status: waiting until BMS interval timer expires
    Accumulated power on time, hours:minutes 65854:00 [3951240 minutes]
    Number of background scans performed: 81,  scan progress: 39.04%
    Number of background medium scans performed: 81
Device does not support General statistics and performance logging

Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 7
  number of phys = 1
  phy identifier = 0
    attached device type: expander device
    attached reason: power on
    reason: loss of dword synchronization
    negotiated logical link rate: phy enabled; 6 Gbps
    attached initiator port: ssp=0 stp=0 smp=1
    attached target port: ssp=0 stp=0 smp=1
    SAS address = 0x5002538a75801d81
    attached SAS address = 0x5001438007bdbba6
    attached phy identifier = 0
    Invalid DWORD count = 58128
    Running disparity error count = 58177
    Loss of DWORD synchronization count = 5
    Phy reset problem count = 0
    Phy event descriptors:
     Received ERROR count: 58368
     Received address frame error count: 0
     Received abandon-class OPEN_REJECT count: 0
     Received retry-class OPEN_REJECT count: 195866
     Received SSP frame error count: 0
relative target port id = 2
  generation code = 7
  number of phys = 1
  phy identifier = 1
    attached device type: no device attached
    attached reason: unknown
    reason: power on
    negotiated logical link rate: phy enabled; unknown
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5002538a75801d82
    attached SAS address = 0x0
    attached phy identifier = 0
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization count = 0
    Phy reset problem count = 0
    Phy event descriptors:
     Received ERROR count: 0
     Received address frame error count: 0
     Received abandon-class OPEN_REJECT count: 0
     Received retry-class OPEN_REJECT count: 0
     Received SSP frame error count: 0