Alerts on SMART Failure but Storage Pool still shows green check

Just as the title says, I dont know if I have a false positive or true positive. It’s the /dev/sdd in the screenshot attached. I have a spare drive in the vdev.


Please post the results of sudo smartctl -x /dev/sdd (using the preformatted text button Ctrl-e).

It looks like the cli spit out error too. I have a spare drive (/dev/sda) in the vdev. However, I don’t have option to choose it when I click Replace in /dev/sdd

root@truenas[~]# smartctl -x /dev/sdd
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.32-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               IBM-ESXS
Product:              ST14000NM0288 E
Revision:             ECH8
Compliance:           SPC-5
User Capacity:        13,902,809,137,152 bytes [13.9 TB]
Logical block size:   4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c500a7a0556f
Serial number:        ZHZ1G1JZ0000C914QUDG
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Thu Sep 12 07:32:00 2024 PDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
Read Cache is:        Enabled
Writeback Cache is:   Disabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: Warning - physical element status change [asc=b, ascq=14]

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned = 353
Power on minutes since format <not available>
Current Drive Temperature:     45 C
Drive Trip Temperature:        65 C

Elements in grown defect list: 353

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0      752         0       752        840     601307.949          98
write:         0        0         0         0          0     489757.776           0
verify:        0      167         0       167        167      26283.089           0

Non-medium error count:        0

  Pending defect count:2 Pending Defects: index, LBA and accumulated_power_on_hours follow
     1:  0x1ed16           ,  38271
     2:  0x101c5348        ,  38266
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed                   -   40468                 - [-   -    -]
# 2  Background long   Completed                   -   40053                 - [-   -    -]
# 3  Background long   Completed                   -   39925                 - [-   -    -]
# 4  Background short  Aborted (device reset ?)    -   39896                 - [-   -    -]
# 5  Background long   Completed                   -   39835                 - [-   -    -]
# 6  Background long   Completed                   -   39665                 - [-   -    -]
# 7  Background long   Completed                   -   39497                 - [-   -    -]
# 8  Background long   Completed                   -   39328                 - [-   -    -]
# 9  Background long   Completed                   -   39160                 - [-   -    -]
#10  Background long   Aborted (device reset ?)    -   38976                 - [-   -    -]
#11  Background long   Completed                   -   38823                 - [-   -    -]
#12  Background long   Completed                   -   38655                 - [-   -    -]
#13  Background long   Completed                   -   38487                 - [-   -    -]
#14  Background long   Completed                   -   38374                 - [-   -    -]
#15  Background long   Failed in segment -->       -   38271            126230 [0x3 0x11 0x0]
#16  Background long   Failed in segment -->       -   38266         270291784 [0x3 0x11 0x0]
#17  Background long   Completed                   -   27086                 - [-   -    -]
#18  Background short  Aborted (by user command)   -       5                 - [-   -    -]

Long (extended) Self-test duration: 80400 seconds [22.3 hours]

Background scan results log
  Status: no scans active
    Accumulated power on time, hours:minutes 40561:02 [2433662 minutes]
    Number of background scans performed: 0,  scan progress: 0.00%
    Number of background medium scans performed: 0

   #  when        lba(hex)    [sk,asc,ascq]    reassign_status
   1 2325:39  00000000bf2a4e39  [1,17,1]   Recovered via rewrite in-place
   2 3420:33  00000000c6aa9196  [1,18,4]   Recovered via rewrite in-place
   3 27067:24  000000005a867224  [1,17,1]   Recovered via rewrite in-place
   4 36647:21  00000000c3481b2e  [1,18,4]   Recovered via rewrite in-place
   5 37841:35  00000000c0c805df  [1,18,4]   Recovered via rewrite in-place
   6 38275:18  00000000037a10c8  [1,18,4]   Recovered via rewrite in-place
   7 39185:27  000000000c766b27  [1,18,4]   Recovered via rewrite in-place
   8 39685:21  00000000242df499  [1,17,3]   Recovered via rewrite in-place
   9 39897:00  0000000036c7ffc3  [1,17,3]   Recovered via rewrite in-place
  10 40010:31  0000000036c7fbe8  [1,17,3]   Recovered via rewrite in-place
 49152 39787:24  00000000c148d9dc  [1,18,8]   Recovered via rewrite in-place
 49153 39787:24  00000000c148dc15  [1,18,8]   Recovered via rewrite in-place
 49154 39787:24  00000000c148dc16  [1,18,8]   Recovered via rewrite in-place
 49155 39787:24  00000000c148dc17  [1,18,8]   Recovered via rewrite in-place
 49156 39787:25  00000000c14b0493  [1,18,8]   Recovered via rewrite in-place
 49157 39787:25  00000000c14b0499  [1,18,8]   Recovered via rewrite in-place
 49158 39787:25  00000000c14de90a  [1,18,8]   Recovered via rewrite in-place
 49159 39787:27  00000000c14ff816  [1,18,8]   Recovered via rewrite in-place
 49160 39787:33  00000000c6a07251  [1,18,8]   Recovered via rewrite in-place

General statistics and performance log page:
  General access statistics and performance:
    Number of read commands: 2916071
    Number of write commands: 37750837
    number of logical blocks received: 652939878
    number of logical blocks transmitted: 57481799
    read command processing intervals: 4648
      in seconds: 278880.000
      in hours: 77.466
    write command processing intervals: 32763
      in seconds: 1965780.000
      in hours: 546.050
    weighted number of read commands plus write commands: 0
    weighted read command processing plus write command processing: 0
  Idle time:
    Idle time intervals: 268
      in seconds: 16080.000
      in hours: 4.466

Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 0
  number of phys = 1
  phy identifier = 0
    attached device type: SAS or SATA device
    attached reason: unknown
    reason: hard reset
    negotiated logical link rate: phy enabled; 6 Gbps
    attached initiator port: ssp=1 stp=1 smp=1
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000c500a7a0556d
    attached SAS address = 0x50000d110927da00
    attached phy identifier = 4
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization count = 6
    Phy reset problem count = 2
relative target port id = 2
  generation code = 0
  number of phys = 1
  phy identifier = 1
    attached device type: no device attached
    attached reason: unknown
    reason: unknown
    negotiated logical link rate: phy enabled; unknown
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000c500a7a0556e
    attached SAS address = 0x0
    attached phy identifier = 0
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization count = 0
    Phy reset problem count = 0

As a general matter: there’s no reason to expect any particular relationship between SMART data and ZFS pool status–they’re two completely different things.

SAS drives have some very different SMART output, so the formatting isn’t what we’re generally used to. But your drive is running hot, which could cause other problems, and it looks like 350+ bad/reallocated sectors. That’s kind of a problem.

1 Like

45C is a little warm, but well below the 65C maximum.

I would agree with that. I am also worried about 2 pending defects from about 3 months ago, and (apparently) 49,160 errors (not sure I can believe that partly because the power-on hours are out of sequence - but this is what it says) a large number of which were less than 5 weeks ago.

I may be wrong, but this drive looks close to failing to me.

P.S. You should think about implementing @joeschmuck 's Multi-Report script so you get a daily check and an error email when hard drive problems start, and a weekly email with a backup of your configuration file enclosed (which you will find very useful if you ever lose your boot drive).

Below the maximum, yes, but 40C is about the limit we like to see for long life.

1 Like