Good day all, I am quite new to TrueNAS, less than 6 months. I got this error emailed to me this morning by one of my TrueNAS server and I am unsure what it means as I checked the storage devices, all say no errors. Anyone encountered this before or able to point me in the right direction? Thank you.
This is a drive having a hardware issue, but ZFS should ensure data integrity. To check, open a SSH session (preferably) or the GUI terminal and type
sudo zpool status -v
sudo smartctl -a /dev/sdd
(if you have rebooted since, it could now be another device)
You may copy the output here, for us to have a look. Please use āPreformatted Textā </>
for readability.
Thank you for reply! This is the output of those commands:
@truenas:~$ sudo zpool status -v
pool: Rust
state: ONLINE
scan: scrub repaired 0B in 04:33:38 with 0 errors on Wed Jan 15 04:33:39 2025
config:
NAME STATE READ WRITE CKSUM
Rust ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
08d7015b-2980-4ac1-b149-914a3707298f ONLINE 0 0 0
6b829630-86a3-4dfe-8f73-560f1891491f ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
b1faca14-49e1-4bd2-b588-522bc94bd97b ONLINE 0 0 0
719e31e5-cd5d-42b2-b6d9-b7b22f40b0f5 ONLINE 0 0 0
logs
c014e78e-91ea-4e28-a512-30b131c18647 ONLINE 0 0 0
errors: No known data errors
pool: SSD
state: ONLINE
scan: scrub repaired 0B in 00:04:52 with 0 errors on Wed Jan 15 00:04:53 2025
config:
NAME STATE READ WRITE CKSUM
SSD ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
22510995-fefa-4834-ab75-d27c72d72798 ONLINE 0 0 0
392ab7df-9281-42d0-9032-ade713a50b7d ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
c4d88999-f2c8-4390-a02c-ebbcb964bf1c ONLINE 0 0 0
d157f119-6dda-45da-bd58-5333b270454e ONLINE 0 0 0
logs
6b632731-18a8-4132-815f-0cb0c2c532d3 ONLINE 0 0 0
errors: No known data errors
pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:00:53 with 0 errors on Mon Jan 20 03:45:54 2025
config:
NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdb3 ONLINE 0 0 0
sda3 ONLINE 0 0 0
errors: No known data errors
@truenas:~$ sudo smartctl -a /dev/sdd
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.32-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: ST16000NM000E-3NV101
Serial Number: ZX212M6R
LU WWN Device Id: 5 000c50 0e8737c6d
Firmware Version: ZZF1
User Capacity: 16,000,900,661,248 bytes [16.0 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database 7.3/5528
ATA Version is: ACS-4 (minor revision not indicated)
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Jan 21 11:50:45 2025 AST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 575) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: (1459) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x50bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 081 064 044 Pre-fail Always - 119757920
3 Spin_Up_Time 0x0003 092 091 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 62
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 081 060 045 Pre-fail Always - 117409195
9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 3192
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 62
18 Unknown_Attribute 0x000b 100 100 050 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 069 048 000 Old_age Always - 31 (Min/Max 28/40)
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 33
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 1927
194 Temperature_Celsius 0x0022 031 052 000 Old_age Always - 31 (0 20 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 5
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 5
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0023 100 100 001 Pre-fail Always - 0
240 Head_Flying_Hours 0x0000 100 100 000 Old_age Offline - 3084 (207 124 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 13835802906
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 59215040072
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
1 Extended offline Completed without error 00% 3083 -
2 Short offline Completed without error 00% 2868 -
3 Extended offline Completed without error 00% 2747 -
4 Short offline Completed without error 00% 2460 -
5 Extended offline Completed without error 00% 2339 -
6 Short offline Completed without error 00% 2124 -
7 Extended offline Completed without error 00% 2003 -
8 Short offline Completed without error 00% 1740 -
9 Extended offline Completed without error 00% 1619 -
#10 Short offline Completed without error 00% 1404 -
#11 Extended offline Completed without error 00% 1283 -
#12 Short offline Completed without error 00% 996 -
#13 Short offline Completed without error 00% 0 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
The above only provides legacy SMART information - try āsmartctl -xā for more
Good thing that youāre running regular SMART tests.
The drive is growing dubious sectors (the next step for the drive would be to confirm that these sectors cannot reliably be written to, and āreallocateā them under ID#5).
The count stands at 5, up from 1 in the alert message. Iād say itās time to get a (preferably validated) cold spare, replace the drive and initate a RMA with Seagate. In that order because resilvering withour redundancy gets a bit risky with 16 TB drivesā¦
Not related to your issue, I wonder why you have SLOG on both pools, and whether these are proper SLOG devices with PLP.
I have a drive with 4 uncorrectables, but itās been that way for years. itās on a backup server with regular scrubs, itās not grown. if your uncorrectable sectors are growing, that is a problem and that disk needs replaced.
it would be nice if there was a way you could postpone a SMART alert unless a value actually increases. the way it currently is, you can disable SMART on that drive, dismiss the alert every time it appears, or try to ignore itā¦
I will reach out to server part deals as it was purchased from the a few months ago. Weāll see how that process goes.
The pools have SLOG because one is a VM Datastore that many run off of. The other is for testing mostly. I wanted to see if I would see better performance when weekly VM backups run.
No PLP as the server is on a UPS and there is an identical server in my rack that gets 2 periodic snapshots a day sent to it and then another offsite.
For async backup, no.
A UPS is NOT a valid substitute for PLP. See here:
Plus, without PLP youāre not getting the full benefit of a SLOG since the logging has to actually proceed to flash instead of being acknowledged as soon as it stands in the (PLP-enabled) SLOG device.
Interesting. I will look into getting a Samsung PM963 m.2 on ebay I think rather than the intel optaneās I am presently using. Thanks for the info.
Already hitting a wall with ServerPartDeals for anyone reading this down the road. They want the drive back first for their own testing before replacing it.
Optane has PLP by design, even if thatās only acknowledged in the spec sheet for DC Optane, and not for āconsumerā 900p/905p.
If your SLOGS are Optane, keep them as they are.
Interesting. Mine are small 16GB MEMPEK1W016GAXT
On the Intel website it says they do not have āEnhanced Power Loss Data Protectionā. I will look into ebay for the Samsung drives. Iād prefer to avoid any headaches that I can. Although I donāt mind a hiccup to learn more.
Ah Optane M10ā¦
I use these as boot drives. For SLOG they may still somewhat lose in endurance (due to the small size) and performance (PCIe x2 and not enough channels to these 3DXPoint modules).
Optaneās are an exception. The write speed is so fast that they donāt need a write cache, and since they donāt need a write-cache, they donāt need to Power Loss Protect the write cache.
Interesting. Thank you. Appreciate both of your inputs.
I think the Samsung PM963 is actually an enterprise drive with PLP.
I have the SATA SM863 and itās definitely an enterprise drive with great fsync performance. Iād imagine the PM 963 should also have it cause itās a higher model number.
Honestly, Iām not a big fan of āunofficial supportā when it comes to my NAS integrity (ie. Ryzen ECC, non DC Optane). Especially after the whole WD RED SMR debacle and itās not like Intel doesnāt have a track record of being shady (ie. 13th-14th gen Core bugs), but Iām an extremely low risk taker, so I guess to each their own.
xx3 is indeed be a DC drive.
PM 9xx is NVMe rather than SATA (SM 8xx), and 960 is quite old (2016). I doubt that the generation (here the ā6ā) are kept in sync between interfaces, so letās not read too much into a āhigher numberā
The same thing happened to me yesterday.
As I removed the disk in order to RMA it i put it on a windows maschine in order to wipe it.
After wiping it i ran the HDD manufacturerās tools to take some screenshot of the failed SMART tests, but every test I took was successful with 0 uncorrectable sectors.
I decided to put the disk back in the NAS and it also reported no errors in the SMART tests.
Could it be that wiping fixed that single bad sector?
Maybe you sould try that too.