Device: /dev/sdd [SAT], 1 Offline uncorrectable sectors

chase84 · January 21, 2025, 2:46pm

Good day all, I am quite new to TrueNAS, less than 6 months. I got this error emailed to me this morning by one of my TrueNAS server and I am unsure what it means as I checked the storage devices, all say no errors. Anyone encountered this before or able to point me in the right direction? Thank you.

etorix · January 21, 2025, 3:48pm

This is a drive having a hardware issue, but ZFS should ensure data integrity. To check, open a SSH session (preferably) or the GUI terminal and type
sudo zpool status -v
sudo smartctl -a /dev/sdd
(if you have rebooted since, it could now be another device)

You may copy the output here, for us to have a look. Please use ‘Preformatted Text’ </> for readability.

chase84 · January 21, 2025, 3:53pm

Thank you for reply! This is the output of those commands:

@not_truenas:~$ sudo zpool status -v
pool: Rust
state: ONLINE
scan: scrub repaired 0B in 04:33:38 with 0 errors on Wed Jan 15 04:33:39 2025
config:

    NAME                                      STATE     READ WRITE CKSUM
    Rust                                      ONLINE       0     0     0
      mirror-0                                ONLINE       0     0     0
        08d7015b-2980-4ac1-b149-914a3707298f  ONLINE       0     0     0
        6b829630-86a3-4dfe-8f73-560f1891491f  ONLINE       0     0     0
      mirror-1                                ONLINE       0     0     0
        b1faca14-49e1-4bd2-b588-522bc94bd97b  ONLINE       0     0     0
        719e31e5-cd5d-42b2-b6d9-b7b22f40b0f5  ONLINE       0     0     0
    logs
      c014e78e-91ea-4e28-a512-30b131c18647    ONLINE       0     0     0

errors: No known data errors

pool: SSD
state: ONLINE
scan: scrub repaired 0B in 00:04:52 with 0 errors on Wed Jan 15 00:04:53 2025
config:

    NAME                                      STATE     READ WRITE CKSUM
    SSD                                       ONLINE       0     0     0
      mirror-0                                ONLINE       0     0     0
        22510995-fefa-4834-ab75-d27c72d72798  ONLINE       0     0     0
        392ab7df-9281-42d0-9032-ade713a50b7d  ONLINE       0     0     0
      mirror-1                                ONLINE       0     0     0
        c4d88999-f2c8-4390-a02c-ebbcb964bf1c  ONLINE       0     0     0
        d157f119-6dda-45da-bd58-5333b270454e  ONLINE       0     0     0
    logs
      6b632731-18a8-4132-815f-0cb0c2c532d3    ONLINE       0     0     0

errors: No known data errors

pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:00:53 with 0 errors on Mon Jan 20 03:45:54 2025
config:

    NAME        STATE     READ WRITE CKSUM
    boot-pool   ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        sdb3    ONLINE       0     0     0
        sda3    ONLINE       0     0     0

errors: No known data errors
@not_truenas:~$ sudo smartctl -a /dev/sdd
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.32-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model: ST16000NM000E-3NV101
Serial Number: ZX212M6R
LU WWN Device Id: 5 000c50 0e8737c6d
Firmware Version: ZZF1
User Capacity: 16,000,900,661,248 bytes [16.0 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database 7.3/5528
ATA Version is: ACS-4 (minor revision not indicated)
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Jan 21 11:50:45 2025 AST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 575) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: (1459) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x50bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 081 064 044 Pre-fail Always - 119757920
3 Spin_Up_Time 0x0003 092 091 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 62
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 081 060 045 Pre-fail Always - 117409195
9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 3192
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 62
18 Unknown_Attribute 0x000b 100 100 050 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 069 048 000 Old_age Always - 31 (Min/Max 28/40)
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 33
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 1927
194 Temperature_Celsius 0x0022 031 052 000 Old_age Always - 31 (0 20 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 5
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 5
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0023 100 100 001 Pre-fail Always - 0
240 Head_Flying_Hours 0x0000 100 100 000 Old_age Offline - 3084 (207 124 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 13835802906
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 59215040072

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

1 Extended offline Completed without error 00% 3083 -

2 Short offline Completed without error 00% 2868 -

3 Extended offline Completed without error 00% 2747 -

4 Short offline Completed without error 00% 2460 -

5 Extended offline Completed without error 00% 2339 -

6 Short offline Completed without error 00% 2124 -

7 Extended offline Completed without error 00% 2003 -

8 Short offline Completed without error 00% 1740 -

9 Extended offline Completed without error 00% 1619 -

#10 Short offline Completed without error 00% 1404 -
#11 Extended offline Completed without error 00% 1283 -
#12 Short offline Completed without error 00% 996 -
#13 Short offline Completed without error 00% 0 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try ‘smartctl -x’ for more

etorix · January 21, 2025, 4:06pm

Good thing that you’re running regular SMART tests.

The drive is growing dubious sectors (the next step for the drive would be to confirm that these sectors cannot reliably be written to, and ‘reallocate’ them under ID#5).
The count stands at 5, up from 1 in the alert message. I’d say it’s time to get a (preferably validated) cold spare, replace the drive and initate a RMA with Seagate. In that order because resilvering withour redundancy gets a bit risky with 16 TB drives…

Not related to your issue, I wonder why you have SLOG on both pools, and whether these are proper SLOG devices with PLP.

RetroG · January 21, 2025, 4:23pm

I have a drive with 4 uncorrectables, but it’s been that way for years. it’s on a backup server with regular scrubs, it’s not grown. if your uncorrectable sectors are growing, that is a problem and that disk needs replaced.

it would be nice if there was a way you could postpone a SMART alert unless a value actually increases. the way it currently is, you can disable SMART on that drive, dismiss the alert every time it appears, or try to ignore it…

chase84 · January 21, 2025, 5:14pm

I will reach out to server part deals as it was purchased from the a few months ago. We’ll see how that process goes.

The pools have SLOG because one is a VM Datastore that many run off of. The other is for testing mostly. I wanted to see if I would see better performance when weekly VM backups run.

No PLP as the server is on a UPS and there is an identical server in my rack that gets 2 periodic snapshots a day sent to it and then another offsite.

etorix · January 21, 2025, 5:43pm

For async backup, no.

A UPS is NOT a valid substitute for PLP. See here:

Plus, without PLP you’re not getting the full benefit of a SLOG since the logging has to actually proceed to flash instead of being acknowledged as soon as it stands in the (PLP-enabled) SLOG device.

chase84 · January 21, 2025, 6:17pm

Interesting. I will look into getting a Samsung PM963 m.2 on ebay I think rather than the intel optane’s I am presently using. Thanks for the info.

Already hitting a wall with ServerPartDeals for anyone reading this down the road. They want the drive back first for their own testing before replacing it.

etorix · January 21, 2025, 6:23pm

Optane has PLP by design, even if that’s only acknowledged in the spec sheet for DC Optane, and not for “consumer” 900p/905p.
If your SLOGS are Optane, keep them as they are.

chase84 · January 21, 2025, 6:39pm

Interesting. Mine are small 16GB MEMPEK1W016GAXT

On the Intel website it says they do not have “Enhanced Power Loss Data Protection”. I will look into ebay for the Samsung drives. I’d prefer to avoid any headaches that I can. Although I don’t mind a hiccup to learn more.

etorix · January 21, 2025, 7:02pm

Ah Optane M10…
I use these as boot drives. For SLOG they may still somewhat lose in endurance (due to the small size) and performance (PCIe x2 and not enough channels to these 3DXPoint modules).

Stux · January 21, 2025, 10:35pm

Optane’s are an exception. The write speed is so fast that they don’t need a write cache, and since they don’t need a write-cache, they don’t need to Power Loss Protect the write cache.

chase84 · January 22, 2025, 2:07pm

Interesting. Thank you. Appreciate both of your inputs.

Whattteva · January 22, 2025, 4:52pm

I think the Samsung PM963 is actually an enterprise drive with PLP.
I have the SATA SM863 and it’s definitely an enterprise drive with great fsync performance. I’d imagine the PM 963 should also have it cause it’s a higher model number.

Honestly, I’m not a big fan of “unofficial support” when it comes to my NAS integrity (ie. Ryzen ECC, non DC Optane). Especially after the whole WD RED SMR debacle and it’s not like Intel doesn’t have a track record of being shady (ie. 13th-14th gen Core bugs), but I’m an extremely low risk taker, so I guess to each their own.

etorix · January 22, 2025, 8:55pm

xx3 is indeed be a DC drive.
PM 9xx is NVMe rather than SATA (SM 8xx), and 960 is quite old (2016). I doubt that the generation (here the ‘6’) are kept in sync between interfaces, so let’s not read too much into a “higher number”

katsanosx · January 23, 2025, 5:13pm

The same thing happened to me yesterday.

As I removed the disk in order to RMA it i put it on a windows maschine in order to wipe it.

After wiping it i ran the HDD manufacturer’s tools to take some screenshot of the failed SMART tests, but every test I took was successful with 0 uncorrectable sectors.

I decided to put the disk back in the NAS and it also reported no errors in the SMART tests.

Could it be that wiping fixed that single bad sector?
Maybe you sould try that too.