ZFS checksum error

Bl00dWolf · November 11, 2024, 12:48pm

Good afternoon.
I have TrueNAS-SCALE-24.10.0.2.
I have two pools. One is a 16+16 mirror and the other is a 4+4 mirror.
I decided to swap the 4tb for 22tb disks.

I disconnected the SDA from the 4tb pool and replaced it with 22. Resilver started.
After the night I checked the status - it says that both disks have ZFS Checksum errors.

I started to do SCRUB, I don’t know if it was really necessary to do it?

Bl00dWolf-NAS% sudo zpool status -v Secondary_Pool_1
  pool: Secondary_Pool_1
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub in progress since Mon Nov 11 10:06:38 2024
        1.73T / 2.66T scanned at 954M/s, 222G / 2.66T issued at 119M/s
        0B repaired, 8.15% done, 05:57:35 to go
config:

        NAME                                      STATE     READ WRITE CKSUM
        Secondary_Pool_1                          ONLINE       0     0     0
          mirror-0                                ONLINE       8     0     0
            3c28a0bd-5803-4bdf-8ec2-3343ca3ea73b  ONLINE       0     0    11
            91b61e28-82d5-4ed5-944a-8c8ead650c97  ONLINE       9     0     3

errors: No known data errors

But it doesn’t say that any files are affected. What to do in this situation?

Protopia · November 11, 2024, 1:06pm

IMO you let the scrub finish and then look at the status again (but use sudo zpool status -v to get additional error information.

Also please copy and paste the output of the following commands:

lsblk -bo NAME,MODEL,PTTYPE,TYPE,START,SIZE,PARTTYPENAME,PARTUUID
lspci
sas2flash -list
sas3flash -list
sudo smartctl --all /dev/XXX replacing XXX with the device name for each of the two disks.

Bl00dWolf · November 12, 2024, 12:44pm

Okay.
After the scrub, everything was fine.
Replaced 4 tb with 22 tb, went through the resilver, all ok.

Thanks for the help, hopefully everything will be ok now =)

Protopia · November 12, 2024, 5:32pm

Please run the commands and post the output anyway so that we can check that there isn’t an obvious root cause for the problem or some other issue lurking to get you.

Also, implement @joeschmuck’s Multi-Report script.

sfatula · November 12, 2024, 7:07pm

It doesn’t say files were impacted because it corrected the errors with the mirror. I would not assume you are problem free, you are not. Those errors came up for a reason. It still needs to be determined.

Bl00dWolf · November 12, 2024, 10:47pm

Ok, thx, that’s all information:

Bl00dWolf-NAS% sudo lsblk -bo NAME,MODEL,PTTYPE,TYPE,START,SIZE,PARTTYPENAME,PARTUUID
NAME   MODEL                     PTTYPE TYPE    START           SIZE PARTTYPENAME             PARTUUID
sda    ST18000NM000J-2TV103      gpt    disk          18000207937536                          
├─sda1                           gpt    part      128     2147418624 Linux swap               aea82bb5-23a1-4d9c-b5ea-d8a06e046275
└─sda2                           gpt    part  4194432 17998060371456 Solaris /usr & Apple ZFS 1eda4e28-aa78-4e8d-b005-f062186a6991
sdb    WDC WUH722222ALE6L4       gpt    disk          22000969973760                          
└─sdb1                           gpt    part     2048 22000967876608 Solaris /usr & Apple ZFS 26e4d6b9-f0cd-4f2b-85f5-234d78c9e1aa
sdc    WDC WUH722222ALE6L4       gpt    disk          22000969973760                          
└─sdc1                           gpt    part     2048 22000967876608 Solaris /usr & Apple ZFS 3c28a0bd-5803-4bdf-8ec2-3343ca3ea73b
sdd    Samsung SSD 870 EVO 250GB gpt    disk            250059350016                          
├─sdd1                           gpt    part     4096        1048576 BIOS boot                fbbdbdad-903e-45dc-85c1-9bf774071d92
├─sdd2                           gpt    part     6144      536870912 EFI System               73986219-3b00-445d-992a-42606e751374
├─sdd3                           gpt    part 34609152   232339447296 Solaris /usr & Apple ZFS 84125976-78b2-4262-bd1f-a3f8afbea40c
└─sdd4                           gpt    part  1054720    17179869184 Linux swap               948e3532-8c57-4a80-8ecc-081bfb27b8b9
sde    ST18000NM000J-2TV103      gpt    disk          18000207937536                          
├─sde1                           gpt    part      128     2147418624 Linux swap               ec046520-81e1-4a7f-9db5-504cba6658b9
└─sde2                           gpt    part  4194432 17998060371456 Solaris /usr & Apple ZFS de06f47c-23a6-493a-b209-ce6b485b758f
Bl00dWolf-NAS% lspci
00:00.0 Host bridge: Intel Corporation Gemini Lake Host Bridge (rev 06)
00:00.1 Signal processing controller: Intel Corporation Celeron/Pentium Silver Processor Dynamic Platform and Thermal Framework Processor Participant (rev 06)
00:02.0 VGA compatible controller: Intel Corporation GeminiLake [UHD Graphics 600] (rev 06)
00:0e.0 Audio device: Intel Corporation Celeron/Pentium Silver Processor High Definition Audio (rev 06)
00:0f.0 Communication controller: Intel Corporation Celeron/Pentium Silver Processor Trusted Execution Engine Interface (rev 06)
00:12.0 SATA controller: Intel Corporation Celeron/Pentium Silver Processor SATA Controller (rev 06)
00:13.0 PCI bridge: Intel Corporation Gemini Lake PCI Express Root Port (rev f6)
00:13.1 PCI bridge: Intel Corporation Gemini Lake PCI Express Root Port (rev f6)
00:13.2 PCI bridge: Intel Corporation Gemini Lake PCI Express Root Port (rev f6)
00:13.3 PCI bridge: Intel Corporation Gemini Lake PCI Express Root Port (rev f6)
00:15.0 USB controller: Intel Corporation Celeron/Pentium Silver Processor USB 3.0 xHCI Controller (rev 06)
00:1f.0 ISA bridge: Intel Corporation Celeron/Pentium Silver Processor LPC Controller (rev 06)
00:1f.1 SMBus: Intel Corporation Celeron/Pentium Silver Processor Gaussian Mixture Model (rev 06)
02:00.0 SATA controller: JMicron Technology Corp. JMB58x AHCI SATA controller
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
04:00.0 SATA controller: ASMedia Technology Inc. ASM1062 Serial ATA Controller (rev 02)
Bl00dWolf-NAS% sas2flash -list
zsh: command not found: sas2flash
Bl00dWolf-NAS% sudo sas2flash -list
LSI Corporation SAS2 Flash Utility
Version 20.00.00.00 (2014.09.18) 
Copyright (c) 2008-2014 LSI Corporation. All rights reserved 

        No LSI SAS adapters found! Limited Command Set Available!
        ERROR: Command Not allowed without an adapter!
        ERROR: Couldn't Create Command -list
        Exiting Program.
Bl00dWolf-NAS% sudo sas3flash -list
Avago Technologies SAS3 Flash Utility
Version 16.00.00.00 (2017.05.02) 
Copyright 2008-2017 Avago Technologies. All rights reserved.

        No Avago SAS adapters found! Limited Command Set Available!
        ERROR: Command Not allowed without an adapter!
        ERROR: Couldn't Create Command -list
        Exiting Program.
Bl00dWolf-NAS% sudo smartctl --all /dev/sdb
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.44-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Ultrastar DC HC570
Device Model:     WDC  WUH722222ALE6L4
Serial Number:    1PG02WBC
LU WWN Device Id: 5 000cca 408c00ad2
Firmware Version: LNGNW730
User Capacity:    22,000,969,973,760 bytes [22.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-5 (minor revision not indicated)
SATA Version is:  SATA 3.5, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Nov 13 01:46:59 2024 MSK
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 246) Self-test routine in progress...
                                        60% of test remaining.
Total time to complete Offline 
data collection:                (  101) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (2772) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   001    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   054    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0007   083   083   001    Pre-fail  Always       -       365 (Average 365)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       6
  5 Reallocated_Sector_Ct   0x0033   100   100   001    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   001    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   020    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       29
 10 Spin_Retry_Count        0x0013   100   100   001    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       6
 22 Helium_Level            0x0023   100   100   025    Pre-fail  Always       -       6553700
 71 Milli_Micro_Actuator    0x0001   100   100   001    Pre-fail  Offline      -       0 0 0
 90 NAND_Master             0x0031   100   100   001    Pre-fail  Offline      -       0xffff00000000
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       53
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       53
194 Temperature_Celsius     0x0002   046   046   000    Old_age   Always       -       47 (Min/Max 20/49)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%         9         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try 'smartctl -x' for more

Bl00dWolf-NAS% sudo smartctl --all /dev/sdc
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.44-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Ultrastar DC HC570
Device Model:     WDC  WUH722222ALE6L4
Serial Number:    1NG2UWXW
LU WWN Device Id: 5 000cca 2fec149f8
Firmware Version: LNGNW730
User Capacity:    22,000,969,973,760 bytes [22.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-5 (minor revision not indicated)
SATA Version is:  SATA 3.5, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Nov 13 01:47:01 2024 MSK
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 246) Self-test routine in progress...
                                        60% of test remaining.
Total time to complete Offline 
data collection:                (  101) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (2640) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   001    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   054    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0007   083   083   001    Pre-fail  Always       -       360 (Average 360)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       9
  5 Reallocated_Sector_Ct   0x0033   100   100   001    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   001    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   020    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       54
 10 Spin_Retry_Count        0x0013   100   100   001    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       9
 22 Helium_Level            0x0023   100   100   025    Pre-fail  Always       -       6553700
 71 Milli_Micro_Actuator    0x0001   100   100   001    Pre-fail  Offline      -       0 0 0
 90 NAND_Master             0x0031   100   100   001    Pre-fail  Offline      -       0xffff00000000
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       74
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       74
194 Temperature_Celsius     0x0002   041   041   000    Old_age   Always       -       51 (Min/Max 20/52)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%        33         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try 'smartctl -x' for more

Bl00dWolf-NAS% sudo zpool status -v Secondary_Pool_1
  pool: Secondary_Pool_1
 state: ONLINE
  scan: resilvered 2.66T in 03:53:09 with 0 errors on Mon Nov 11 23:55:07 2024
config:

        NAME                                      STATE     READ WRITE CKSUM
        Secondary_Pool_1                          ONLINE       0     0     0
          mirror-0                                ONLINE       0     0     0
            3c28a0bd-5803-4bdf-8ec2-3343ca3ea73b  ONLINE       0     0     0
            26e4d6b9-f0cd-4f2b-85f5-234d78c9e1aa  ONLINE       0     0     0

errors: No known data errors

I’m doing LONG smart test for this 2 new 22tb hdds now

Protopia · November 12, 2024, 11:29pm

Its late and I am tired, but I don’t see any issues with this configuration or drives at all.

If you get errors again, then think about whether the SATA cables need reseating or whether you might have a PSU issue or whether you might need better quality SATA cables.

sfatula · November 13, 2024, 1:10am

You are using on board SATA ports?

In addition to the above (cabling, etc), maybe temps (50’s getting a little higher than I’d want), current BIOS? Memory was tested?

Bl00dWolf · November 13, 2024, 7:20am

Ok, thanks you for help.
SATA should be okay, but i’ll check =)
PSU - BeQuiet SFX POWER 3 300W it should be ok too.

Bl00dWolf · November 13, 2024, 7:23am

Mostly =(
I only have four sata ports on my motherboard.
And 5 disks - boot ssd sata and 4 HDD.
So all 4 standard sata ports are occupied. And I bought m.2 2 sata expansion board.

Regarding memory - laptop 32 gb 16 x 2 3200 that runs at 2400 without xmp.
Checked by memtest.

About temperatures - yes, it’s a lot, but there’s not much I can do about it, the fan is at full speed.

Now it’s most likely due to long smart test. After that it will settle down to 38-45.

Protopia · November 13, 2024, 11:26am

Ah - the m.2 → SATA expansion card could well be the issue:

Possible overheating under stress - adding a heatsink might help.
Possible technical issues leading to failed reads which would show up in the ZFS statistics but not in the SMART statistics.

If you search the new and old forums, you should find a detailed explanation about why such expansion boards can cause problems. (Sorry, I don’t have the link to hand.)

Bl00dWolf · November 13, 2024, 8:27pm

Honestly, I don’t think so.
I’ve had this board for 2 years and the first time I saw something like this was when changing disks.
But I’ll keep it in mind, thanks.

Long smart completed on both 22tb disks with no errors.
Bought them new 22tb for $300 each. Victoria and smart test showed no problems.

Temperatures dropped after the tests.
47 and 45 degrees. Better than 50 i guess

Stux · November 13, 2024, 9:54pm

Check the drives for issues, clear the errors, keep an eye on it and move on.

If you hot swapped the drives without offlining then you could expect to see errors.

Once the scrub completes, if no corruption is found it means all errors have been healed.

Run full smart tests on both drives, check the smart results, specifically look at the UDMA CRC error field which indicates an issue external to the drives, then clear the errors if nothing stands out.

zpool clear <poolname>