I have a system (full details in my sig) with 41 SAS SSDs in it, all of the same vintage, reclaimed from a Netapp system.
I do daily smartctl short tests and weekly long tests
All the disks always pass their short tests, but four disks in my system that are in mirrors (two mirrors, one each for boot-pool and apps use) always fail.
There are another two disks in a mirror, basically used as scratch, that also always pass.
It seems unlikely to me that I picked four bad disks out of 41 to put in mirrors. Is there some interaction between boot and apps use, or the fact that they’re mirrors, that is making the long test fail?
The long test always fails “in segment 8” with no further information.
Any ideas?
Post the output of one of the drives smartctl -x /dev/xxx
Maybe you are interpreting something wrong.
However if it is failing SMART Long self-test, then the drive has failed. SMART is all internal to the drive and has no correlation to anything outside of it’s little box of hardware and spinning disks.
Here’s the output from one of the drives. FYI, I’m running multireport daily to process the results of the smartctl tests:
output from 'smartctl -x /dev/sdc'
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.29-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: NETAPP
Product: X439_S16331T6AMD
Revision: NA04
Compliance: SPC-4
User Capacity: 1,600,321,314,816 bytes [1.60 TB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Logical Unit id: 0x5002538a75801d80
Serial number: S20JNWAG800472
Device type: disk
Transport protocol: SAS (SPL-4)
Local Time is: Tue Jul 2 08:26:47 2024 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
Read Cache is: Enabled
Writeback Cache is: Disabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Percentage used endurance indicator: 0%
Current Drive Temperature: 21 C
Drive Trip Temperature: 60 C
Manufactured in week 31 of year 2015
Accumulated start-stop cycles: 256
Specified load-unload count over device lifetime: 0
Accumulated load-unload cycles: 0
Elements in grown defect list: 0
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 278049.709 0
write: 0 0 0 0 0 128444.062 0
verify: 0 0 0 0 0 504197.952 0
Non-medium error count: 4
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Completed - 315 - [- - -]
# 2 Background long Failed in segment --> 8 291 - [- - -]
# 3 Background short Completed - 267 - [- - -]
# 4 Background short Completed - 243 - [- - -]
# 5 Background short Completed - 219 - [- - -]
# 6 Background short Completed - 195 - [- - -]
# 7 Background short Completed - 171 - [- - -]
# 8 Background short Completed - 147 - [- - -]
# 9 Background short Completed - 123 - [- - -]
#10 Background short Completed - 76 - [- - -]
#11 Background short Completed - 52 - [- - -]
#12 Background short Completed - 28 - [- - -]
#13 Background short Completed - 4 - [- - -]
#14 Background short Completed - 65516 - [- - -]
#15 Background long Failed in segment --> 8 65497 - [- - -]
#16 Background long Failed in segment --> 8 65497 - [- - -]
#17 Background long Failed in segment --> 8 65494 - [- - -]
#18 Background short Completed - 65468 - [- - -]
#19 Background short Completed - 65444 - [- - -]
#20 Background short Completed - 65420 - [- - -]
Long (extended) Self-test duration: 3600 seconds [60.0 minutes]
Background scan results log
Status: waiting until BMS interval timer expires
Accumulated power on time, hours:minutes 65854:00 [3951240 minutes]
Number of background scans performed: 81, scan progress: 39.04%
Number of background medium scans performed: 81
Device does not support General statistics and performance logging
Protocol Specific port log page for SAS SSP
relative target port id = 1
generation code = 7
number of phys = 1
phy identifier = 0
attached device type: expander device
attached reason: power on
reason: loss of dword synchronization
negotiated logical link rate: phy enabled; 6 Gbps
attached initiator port: ssp=0 stp=0 smp=1
attached target port: ssp=0 stp=0 smp=1
SAS address = 0x5002538a75801d81
attached SAS address = 0x5001438007bdbba6
attached phy identifier = 0
Invalid DWORD count = 58128
Running disparity error count = 58177
Loss of DWORD synchronization count = 5
Phy reset problem count = 0
Phy event descriptors:
Received ERROR count: 58368
Received address frame error count: 0
Received abandon-class OPEN_REJECT count: 0
Received retry-class OPEN_REJECT count: 195866
Received SSP frame error count: 0
relative target port id = 2
generation code = 7
number of phys = 1
phy identifier = 1
attached device type: no device attached
attached reason: unknown
reason: power on
negotiated logical link rate: phy enabled; unknown
attached initiator port: ssp=0 stp=0 smp=0
attached target port: ssp=0 stp=0 smp=0
SAS address = 0x5002538a75801d82
attached SAS address = 0x0
attached phy identifier = 0
Invalid DWORD count = 0
Running disparity error count = 0
Loss of DWORD synchronization count = 0
Phy reset problem count = 0
Phy event descriptors:
Received ERROR count: 0
Received address frame error count: 0
Received abandon-class OPEN_REJECT count: 0
Received retry-class OPEN_REJECT count: 0
Received SSP frame error count: 0