Long Running smartctl Test, Can't Cancel

WiteWulf · June 17, 2024, 8:16am

I started using the excellent multi-report recently to keep an eye on my SAS disks.

I have 41x 1.6TB SAS SSD connected via a Dell Perc H310 and HP SAS expander in an HP DL380 G6. The disks were gifted from a Netapp, with ~65k hrs use on each, and reformatted to 512 byte sectors

I scrub monthly, do long tests weekly and short tests daily.

One disk in the system is “stuck” doing a long test that started about a week ago. I noticed today after it failed the multi-report run, as there was no successful long test for today.

I’ve tried issuing ‘smartctl -X /dev/sde’ to abort the test, but get 'Abort self test failed [unsupported field in scsi command]
’

Disk info gathered just now:

root@eurybia[...l1/plex_transcodes/Transcode/Sessions]# smartctl -a /dev/sde
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.29-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               NETAPP
Product:              X439_S16331T6AMD
Revision:             NA04
Compliance:           SPC-4
User Capacity:        1,600,321,314,816 bytes [1.60 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate:        Solid State Device
Form Factor:          2.5 inches
Logical Unit id:      0x5002538a75801c60
Serial number:        S20JNWAG800454
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Mon Jun 17 09:13:52 2024 BST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Percentage used endurance indicator: 0%
Current Drive Temperature:     30 C
Drive Trip Temperature:        60 C

Accumulated power on time, hours:minutes 64795:52
Manufactured in week 31 of year 2015
Accumulated start-stop cycles:  262
Specified load-unload count over device lifetime:  0
Accumulated load-unload cycles:  0
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0      23313.519           0
write:         0        0         0         0          0      28652.495           0
verify:        0        0         0         0          0     247014.160           0

Non-medium error count:       60

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   64769                 - [-   -    -]
# 2  Background short  Completed                   -   64745                 - [-   -    -]
# 3  Background short  Completed                   -   64721                 - [-   -    -]
# 4  Background short  Completed                   -   64697                 - [-   -    -]
# 5  Background short  Completed                   -   64673                 - [-   -    -]
# 6  Background short  Completed                   -   64651                 - [-   -    -]
# 7  Background long   Self test in progress ...   -     NOW                 - [-   -    -]
# 8  Background short  Completed                   -   64601                 - [-   -    -]
# 9  Background short  Completed                   -   64577                 - [-   -    -]
#10  Background short  Completed                   -   64553                 - [-   -    -]

Long (extended) Self-test duration: 3600 seconds [60.0 minutes]

The disk is part of my boot-pool mirror. I considered swapping the disk out for a spare, then in again, to see if that would clear the ongoing test. Any other ideas?

Protopia · June 17, 2024, 8:37am

I doubt that there is anything to cancel. A disk cannot AFAIK do a long test and a short test simultaneously, and since you have more recent successful short tests, the long test you kicked off 6 days ago is probably no longer running.

I am not sure what happens on a drive if a long test takes more than 24 hours and a short test is requested whilst it is still running, but I would suggest that you temporarily turn off the scheduled tests on this drive and manually run a long test to see what happens.

Also, please do a zpool status -vx to check whether you are getting any ZFS pool errors.

WiteWulf · June 17, 2024, 8:52am

root@eurybia[/mnt/Pool1/homes/garyp]# zpool status -vx          
all pools are healthy

I’ll give your suggestion re. scheduled and manual test a go, thanks.

edit

Tried to do a long test:

root@eurybia[/mnt/Pool1/homes/garyp]# smartctl -t long /dev/sde
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.29-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

Can't start self-test without aborting current test (44% remaining),
add '-t force' option to override, or run 'smartctl -X' to abort test.

So I tried cancelling it again, and this time it worked, and I’ve started a new long test, which it’s estimating an hour to run.

WiteWulf · June 17, 2024, 9:57am

Okay, so that long test completed successfully, but the “historical” one is still listed as in progress. I guess that’s stuck there now

Ah well, no worries, I guess. At least I’m a little more confident that my disks are okay now.

joeschmuck · June 17, 2024, 11:31am

Your disk is okay. I don’t think I’ve ever seen an error like yours but that historical Long test was superseded by the next test completion, which was a Short test. The drive can only run one test at a time, Short, Long, Conveyance, whatever.

Glad you do not have a problem. Keep plugging along.