Good afternoon.
I have TrueNAS-SCALE-24.10.0.2.
I have two pools. One is a 16+16 mirror and the other is a 4+4 mirror.
I decided to swap the 4tb for 22tb disks.
I disconnected the SDA from the 4tb pool and replaced it with 22. Resilver started.
After the night I checked the status - it says that both disks have ZFS Checksum errors.
I started to do SCRUB, I don’t know if it was really necessary to do it?
Bl00dWolf-NAS% sudo zpool status -v Secondary_Pool_1
pool: Secondary_Pool_1
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: scrub in progress since Mon Nov 11 10:06:38 2024
1.73T / 2.66T scanned at 954M/s, 222G / 2.66T issued at 119M/s
0B repaired, 8.15% done, 05:57:35 to go
config:
NAME STATE READ WRITE CKSUM
Secondary_Pool_1 ONLINE 0 0 0
mirror-0 ONLINE 8 0 0
3c28a0bd-5803-4bdf-8ec2-3343ca3ea73b ONLINE 0 0 11
91b61e28-82d5-4ed5-944a-8c8ead650c97 ONLINE 9 0 3
errors: No known data errors
But it doesn’t say that any files are affected. What to do in this situation?
IMO you let the scrub finish and then look at the status again (but use sudo zpool status -v
to get additional error information.
Also please copy and paste the output of the following commands:
lsblk -bo NAME,MODEL,PTTYPE,TYPE,START,SIZE,PARTTYPENAME,PARTUUID
lspci
sas2flash -list
sas3flash -list
sudo smartctl --all /dev/XXX
replacing XXX with the device name for each of the two disks.
3 Likes
Okay.
After the scrub, everything was fine.
Replaced 4 tb with 22 tb, went through the resilver, all ok.
Thanks for the help, hopefully everything will be ok now =)
Please run the commands and post the output anyway so that we can check that there isn’t an obvious root cause for the problem or some other issue lurking to get you.
Also, implement @joeschmuck ’s Multi-Report script.
sfatula
November 12, 2024, 7:07pm
5
It doesn’t say files were impacted because it corrected the errors with the mirror. I would not assume you are problem free, you are not. Those errors came up for a reason. It still needs to be determined.
1 Like
Ok, thx, that’s all information:
Bl00dWolf-NAS% sudo lsblk -bo NAME,MODEL,PTTYPE,TYPE,START,SIZE,PARTTYPENAME,PARTUUID
NAME MODEL PTTYPE TYPE START SIZE PARTTYPENAME PARTUUID
sda ST18000NM000J-2TV103 gpt disk 18000207937536
├─sda1 gpt part 128 2147418624 Linux swap aea82bb5-23a1-4d9c-b5ea-d8a06e046275
└─sda2 gpt part 4194432 17998060371456 Solaris /usr & Apple ZFS 1eda4e28-aa78-4e8d-b005-f062186a6991
sdb WDC WUH722222ALE6L4 gpt disk 22000969973760
└─sdb1 gpt part 2048 22000967876608 Solaris /usr & Apple ZFS 26e4d6b9-f0cd-4f2b-85f5-234d78c9e1aa
sdc WDC WUH722222ALE6L4 gpt disk 22000969973760
└─sdc1 gpt part 2048 22000967876608 Solaris /usr & Apple ZFS 3c28a0bd-5803-4bdf-8ec2-3343ca3ea73b
sdd Samsung SSD 870 EVO 250GB gpt disk 250059350016
├─sdd1 gpt part 4096 1048576 BIOS boot fbbdbdad-903e-45dc-85c1-9bf774071d92
├─sdd2 gpt part 6144 536870912 EFI System 73986219-3b00-445d-992a-42606e751374
├─sdd3 gpt part 34609152 232339447296 Solaris /usr & Apple ZFS 84125976-78b2-4262-bd1f-a3f8afbea40c
└─sdd4 gpt part 1054720 17179869184 Linux swap 948e3532-8c57-4a80-8ecc-081bfb27b8b9
sde ST18000NM000J-2TV103 gpt disk 18000207937536
├─sde1 gpt part 128 2147418624 Linux swap ec046520-81e1-4a7f-9db5-504cba6658b9
└─sde2 gpt part 4194432 17998060371456 Solaris /usr & Apple ZFS de06f47c-23a6-493a-b209-ce6b485b758f
Bl00dWolf-NAS% lspci
00:00.0 Host bridge: Intel Corporation Gemini Lake Host Bridge (rev 06)
00:00.1 Signal processing controller: Intel Corporation Celeron/Pentium Silver Processor Dynamic Platform and Thermal Framework Processor Participant (rev 06)
00:02.0 VGA compatible controller: Intel Corporation GeminiLake [UHD Graphics 600] (rev 06)
00:0e.0 Audio device: Intel Corporation Celeron/Pentium Silver Processor High Definition Audio (rev 06)
00:0f.0 Communication controller: Intel Corporation Celeron/Pentium Silver Processor Trusted Execution Engine Interface (rev 06)
00:12.0 SATA controller: Intel Corporation Celeron/Pentium Silver Processor SATA Controller (rev 06)
00:13.0 PCI bridge: Intel Corporation Gemini Lake PCI Express Root Port (rev f6)
00:13.1 PCI bridge: Intel Corporation Gemini Lake PCI Express Root Port (rev f6)
00:13.2 PCI bridge: Intel Corporation Gemini Lake PCI Express Root Port (rev f6)
00:13.3 PCI bridge: Intel Corporation Gemini Lake PCI Express Root Port (rev f6)
00:15.0 USB controller: Intel Corporation Celeron/Pentium Silver Processor USB 3.0 xHCI Controller (rev 06)
00:1f.0 ISA bridge: Intel Corporation Celeron/Pentium Silver Processor LPC Controller (rev 06)
00:1f.1 SMBus: Intel Corporation Celeron/Pentium Silver Processor Gaussian Mixture Model (rev 06)
02:00.0 SATA controller: JMicron Technology Corp. JMB58x AHCI SATA controller
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
04:00.0 SATA controller: ASMedia Technology Inc. ASM1062 Serial ATA Controller (rev 02)
Bl00dWolf-NAS% sas2flash -list
zsh: command not found: sas2flash
Bl00dWolf-NAS% sudo sas2flash -list
LSI Corporation SAS2 Flash Utility
Version 20.00.00.00 (2014.09.18)
Copyright (c) 2008-2014 LSI Corporation. All rights reserved
No LSI SAS adapters found! Limited Command Set Available!
ERROR: Command Not allowed without an adapter!
ERROR: Couldn't Create Command -list
Exiting Program.
Bl00dWolf-NAS% sudo sas3flash -list
Avago Technologies SAS3 Flash Utility
Version 16.00.00.00 (2017.05.02)
Copyright 2008-2017 Avago Technologies. All rights reserved.
No Avago SAS adapters found! Limited Command Set Available!
ERROR: Command Not allowed without an adapter!
ERROR: Couldn't Create Command -list
Exiting Program.
Bl00dWolf-NAS% sudo smartctl --all /dev/sdb
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.44-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Ultrastar DC HC570
Device Model: WDC WUH722222ALE6L4
Serial Number: 1PG02WBC
LU WWN Device Id: 5 000cca 408c00ad2
Firmware Version: LNGNW730
User Capacity: 22,000,969,973,760 bytes [22.0 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: In smartctl database 7.3/5528
ATA Version is: ACS-5 (minor revision not indicated)
SATA Version is: SATA 3.5, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed Nov 13 01:46:59 2024 MSK
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 246) Self-test routine in progress...
60% of test remaining.
Total time to complete Offline
data collection: ( 101) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: (2772) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 001 Pre-fail Always - 0
2 Throughput_Performance 0x0005 100 100 054 Pre-fail Offline - 0
3 Spin_Up_Time 0x0007 083 083 001 Pre-fail Always - 365 (Average 365)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 6
5 Reallocated_Sector_Ct 0x0033 100 100 001 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 001 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 100 100 020 Pre-fail Offline - 0
9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 29
10 Spin_Retry_Count 0x0013 100 100 001 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 6
22 Helium_Level 0x0023 100 100 025 Pre-fail Always - 6553700
71 Milli_Micro_Actuator 0x0001 100 100 001 Pre-fail Offline - 0 0 0
90 NAND_Master 0x0031 100 100 001 Pre-fail Offline - 0xffff00000000
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 53
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 53
194 Temperature_Celsius 0x0002 046 046 000 Old_age Always - 47 (Min/Max 20/49)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 100 100 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 9 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
The above only provides legacy SMART information - try 'smartctl -x' for more
Bl00dWolf-NAS% sudo smartctl --all /dev/sdc
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.44-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Ultrastar DC HC570
Device Model: WDC WUH722222ALE6L4
Serial Number: 1NG2UWXW
LU WWN Device Id: 5 000cca 2fec149f8
Firmware Version: LNGNW730
User Capacity: 22,000,969,973,760 bytes [22.0 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: In smartctl database 7.3/5528
ATA Version is: ACS-5 (minor revision not indicated)
SATA Version is: SATA 3.5, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed Nov 13 01:47:01 2024 MSK
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 246) Self-test routine in progress...
60% of test remaining.
Total time to complete Offline
data collection: ( 101) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: (2640) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 001 Pre-fail Always - 0
2 Throughput_Performance 0x0005 100 100 054 Pre-fail Offline - 0
3 Spin_Up_Time 0x0007 083 083 001 Pre-fail Always - 360 (Average 360)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 9
5 Reallocated_Sector_Ct 0x0033 100 100 001 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 001 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 100 100 020 Pre-fail Offline - 0
9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 54
10 Spin_Retry_Count 0x0013 100 100 001 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 9
22 Helium_Level 0x0023 100 100 025 Pre-fail Always - 6553700
71 Milli_Micro_Actuator 0x0001 100 100 001 Pre-fail Offline - 0 0 0
90 NAND_Master 0x0031 100 100 001 Pre-fail Offline - 0xffff00000000
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 74
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 74
194 Temperature_Celsius 0x0002 041 041 000 Old_age Always - 51 (Min/Max 20/52)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 100 100 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 33 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
The above only provides legacy SMART information - try 'smartctl -x' for more
Bl00dWolf-NAS% sudo zpool status -v Secondary_Pool_1
pool: Secondary_Pool_1
state: ONLINE
scan: resilvered 2.66T in 03:53:09 with 0 errors on Mon Nov 11 23:55:07 2024
config:
NAME STATE READ WRITE CKSUM
Secondary_Pool_1 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
3c28a0bd-5803-4bdf-8ec2-3343ca3ea73b ONLINE 0 0 0
26e4d6b9-f0cd-4f2b-85f5-234d78c9e1aa ONLINE 0 0 0
errors: No known data errors
I’m doing LONG smart test for this 2 new 22tb hdds now
Its late and I am tired, but I don’t see any issues with this configuration or drives at all.
If you get errors again, then think about whether the SATA cables need reseating or whether you might have a PSU issue or whether you might need better quality SATA cables.
2 Likes
sfatula
November 13, 2024, 1:10am
8
You are using on board SATA ports?
In addition to the above (cabling, etc), maybe temps (50’s getting a little higher than I’d want), current BIOS? Memory was tested?
1 Like
Ok, thanks you for help.
SATA should be okay, but i’ll check =)
PSU - BeQuiet SFX POWER 3 300W it should be ok too.
Mostly =(
I only have four sata ports on my motherboard.
And 5 disks - boot ssd sata and 4 HDD.
So all 4 standard sata ports are occupied. And I bought m.2 2 sata expansion board.
Regarding memory - laptop 32 gb 16 x 2 3200 that runs at 2400 without xmp.
Checked by memtest.
About temperatures - yes, it’s a lot, but there’s not much I can do about it, the fan is at full speed.
Now it’s most likely due to long smart test. After that it will settle down to 38-45.
Ah - the m.2 → SATA expansion card could well be the issue:
Possible overheating under stress - adding a heatsink might help.
Possible technical issues leading to failed reads which would show up in the ZFS statistics but not in the SMART statistics.
If you search the new and old forums, you should find a detailed explanation about why such expansion boards can cause problems. (Sorry, I don’t have the link to hand.)
4 Likes
Honestly, I don’t think so.
I’ve had this board for 2 years and the first time I saw something like this was when changing disks.
But I’ll keep it in mind, thanks.
Long smart completed on both 22tb disks with no errors.
Bought them new 22tb for $300 each. Victoria and smart test showed no problems.
Temperatures dropped after the tests.
47 and 45 degrees. Better than 50 i guess
Stux
November 13, 2024, 9:54pm
13
Check the drives for issues, clear the errors, keep an eye on it and move on.
If you hot swapped the drives without offlining then you could expect to see errors.
Once the scrub completes, if no corruption is found it means all errors have been healed.
Run full smart tests on both drives, check the smart results, specifically look at the UDMA CRC error field which indicates an issue external to the drives, then clear the errors if nothing stands out.
zpool clear <poolname>
1 Like