Just as the title says, I dont know if I have a false positive or true positive. It’s the /dev/sdd in the screenshot attached. I have a spare drive in the vdev.
Please post the results of sudo smartctl -x /dev/sdd
(using the preformatted text button Ctrl-e
).
It looks like the cli spit out error too. I have a spare drive (/dev/sda) in the vdev. However, I don’t have option to choose it when I click Replace in /dev/sdd
root@truenas[~]# smartctl -x /dev/sdd
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.32-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: IBM-ESXS
Product: ST14000NM0288 E
Revision: ECH8
Compliance: SPC-5
User Capacity: 13,902,809,137,152 bytes [13.9 TB]
Logical block size: 4096 bytes
LU is fully provisioned
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x5000c500a7a0556f
Serial number: ZHZ1G1JZ0000C914QUDG
Device type: disk
Transport protocol: SAS (SPL-4)
Local Time is: Thu Sep 12 07:32:00 2024 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
Read Cache is: Enabled
Writeback Cache is: Disabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: Warning - physical element status change [asc=b, ascq=14]
Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned = 353
Power on minutes since format <not available>
Current Drive Temperature: 45 C
Drive Trip Temperature: 65 C
Elements in grown defect list: 353
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 752 0 752 840 601307.949 98
write: 0 0 0 0 0 489757.776 0
verify: 0 167 0 167 167 26283.089 0
Non-medium error count: 0
Pending defect count:2 Pending Defects: index, LBA and accumulated_power_on_hours follow
1: 0x1ed16 , 38271
2: 0x101c5348 , 38266
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background long Completed - 40468 - [- - -]
# 2 Background long Completed - 40053 - [- - -]
# 3 Background long Completed - 39925 - [- - -]
# 4 Background short Aborted (device reset ?) - 39896 - [- - -]
# 5 Background long Completed - 39835 - [- - -]
# 6 Background long Completed - 39665 - [- - -]
# 7 Background long Completed - 39497 - [- - -]
# 8 Background long Completed - 39328 - [- - -]
# 9 Background long Completed - 39160 - [- - -]
#10 Background long Aborted (device reset ?) - 38976 - [- - -]
#11 Background long Completed - 38823 - [- - -]
#12 Background long Completed - 38655 - [- - -]
#13 Background long Completed - 38487 - [- - -]
#14 Background long Completed - 38374 - [- - -]
#15 Background long Failed in segment --> - 38271 126230 [0x3 0x11 0x0]
#16 Background long Failed in segment --> - 38266 270291784 [0x3 0x11 0x0]
#17 Background long Completed - 27086 - [- - -]
#18 Background short Aborted (by user command) - 5 - [- - -]
Long (extended) Self-test duration: 80400 seconds [22.3 hours]
Background scan results log
Status: no scans active
Accumulated power on time, hours:minutes 40561:02 [2433662 minutes]
Number of background scans performed: 0, scan progress: 0.00%
Number of background medium scans performed: 0
# when lba(hex) [sk,asc,ascq] reassign_status
1 2325:39 00000000bf2a4e39 [1,17,1] Recovered via rewrite in-place
2 3420:33 00000000c6aa9196 [1,18,4] Recovered via rewrite in-place
3 27067:24 000000005a867224 [1,17,1] Recovered via rewrite in-place
4 36647:21 00000000c3481b2e [1,18,4] Recovered via rewrite in-place
5 37841:35 00000000c0c805df [1,18,4] Recovered via rewrite in-place
6 38275:18 00000000037a10c8 [1,18,4] Recovered via rewrite in-place
7 39185:27 000000000c766b27 [1,18,4] Recovered via rewrite in-place
8 39685:21 00000000242df499 [1,17,3] Recovered via rewrite in-place
9 39897:00 0000000036c7ffc3 [1,17,3] Recovered via rewrite in-place
10 40010:31 0000000036c7fbe8 [1,17,3] Recovered via rewrite in-place
49152 39787:24 00000000c148d9dc [1,18,8] Recovered via rewrite in-place
49153 39787:24 00000000c148dc15 [1,18,8] Recovered via rewrite in-place
49154 39787:24 00000000c148dc16 [1,18,8] Recovered via rewrite in-place
49155 39787:24 00000000c148dc17 [1,18,8] Recovered via rewrite in-place
49156 39787:25 00000000c14b0493 [1,18,8] Recovered via rewrite in-place
49157 39787:25 00000000c14b0499 [1,18,8] Recovered via rewrite in-place
49158 39787:25 00000000c14de90a [1,18,8] Recovered via rewrite in-place
49159 39787:27 00000000c14ff816 [1,18,8] Recovered via rewrite in-place
49160 39787:33 00000000c6a07251 [1,18,8] Recovered via rewrite in-place
General statistics and performance log page:
General access statistics and performance:
Number of read commands: 2916071
Number of write commands: 37750837
number of logical blocks received: 652939878
number of logical blocks transmitted: 57481799
read command processing intervals: 4648
in seconds: 278880.000
in hours: 77.466
write command processing intervals: 32763
in seconds: 1965780.000
in hours: 546.050
weighted number of read commands plus write commands: 0
weighted read command processing plus write command processing: 0
Idle time:
Idle time intervals: 268
in seconds: 16080.000
in hours: 4.466
Protocol Specific port log page for SAS SSP
relative target port id = 1
generation code = 0
number of phys = 1
phy identifier = 0
attached device type: SAS or SATA device
attached reason: unknown
reason: hard reset
negotiated logical link rate: phy enabled; 6 Gbps
attached initiator port: ssp=1 stp=1 smp=1
attached target port: ssp=0 stp=0 smp=0
SAS address = 0x5000c500a7a0556d
attached SAS address = 0x50000d110927da00
attached phy identifier = 4
Invalid DWORD count = 0
Running disparity error count = 0
Loss of DWORD synchronization count = 6
Phy reset problem count = 2
relative target port id = 2
generation code = 0
number of phys = 1
phy identifier = 1
attached device type: no device attached
attached reason: unknown
reason: unknown
negotiated logical link rate: phy enabled; unknown
attached initiator port: ssp=0 stp=0 smp=0
attached target port: ssp=0 stp=0 smp=0
SAS address = 0x5000c500a7a0556e
attached SAS address = 0x0
attached phy identifier = 0
Invalid DWORD count = 0
Running disparity error count = 0
Loss of DWORD synchronization count = 0
Phy reset problem count = 0
As a general matter: there’s no reason to expect any particular relationship between SMART data and ZFS pool status–they’re two completely different things.
SAS drives have some very different SMART output, so the formatting isn’t what we’re generally used to. But your drive is running hot, which could cause other problems, and it looks like 350+ bad/reallocated sectors. That’s kind of a problem.
45C is a little warm, but well below the 65C maximum.
I would agree with that. I am also worried about 2 pending defects from about 3 months ago, and (apparently) 49,160 errors (not sure I can believe that partly because the power-on hours are out of sequence - but this is what it says) a large number of which were less than 5 weeks ago.
I may be wrong, but this drive looks close to failing to me.
P.S. You should think about implementing @joeschmuck 's Multi-Report script so you get a daily check and an error email when hard drive problems start, and a weekly email with a backup of your configuration file enclosed (which you will find very useful if you ever lose your boot drive).
Below the maximum, yes, but 40C is about the limit we like to see for long life.