Checksum Error but SMART test is good?

Hello,

I’m pretty new to this stuff so I just want to know if I should expedite a replacement HDD or if I’m “ok”.
My dataset is showing that one of my drives had a Checksum error but it wasn’t file corruption and I can’t really seem to pinpoint what it was either as SMART isn’t saying my drives dying yet as far as I can tell.

Any help would be greatly appreciated I’ve also attached a few smartctl things i was looking at(i don’t know how to have it not be a whole wall of text).

My specs are:
Ryzen 5 5600G
32gb DDR4 4x8
Asus ROG Strix X470-F Gaming
3x HGST_HUH721010ALE604 10TB (SATA on motherboard)
1x HUH721010AL5201 10TB (SAS on LSI card)
LSI 9207-8i (used only for the 5201 drive)
corsair force 3 120gb SSD for OS (will be replacing soon)

admin@truenas[~]$ sudo zpool status Plex -v 
  pool: Plex
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 09:07:41 with 0 errors on Sun Apr  6 09:07:42 2025
expand: expanded raidz1-0 copied 21.2T in 2 days 00:02:27, on Tue Feb 25 23:05:34 2025
config:

        NAME                                      STATE     READ WRITE CKSUM
        Plex                                      ONLINE       0     0     0
          raidz1-0                                ONLINE       0     0     0
            d425542e-954e-4742-940e-8ac436533345  ONLINE       0     0     0
            24560ce7-09eb-4a8b-9115-c3ddfde7adf8  ONLINE       0     0     0
            693d52a9-7bbe-4bda-a155-9787e1e51166  ONLINE       0     0     1
            c1f17d67-b3ef-414f-b44e-215796e53ffc  ONLINE       0     0     0

errors: No known data errors
admin@truenas[~]$ smartctl -a dev/sdd
zsh: command not found: smartctl
admin@truenas[~]$ sudo smartctl -a dev/sdd
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.44-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

dev/sdd: Unable to detect device type
Please specify device type with the -d option.

Use smartctl -h to get a usage summary



zpool status -v <poolname>smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.44-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     HGST Ultrastar He10
Device Model:     HGST HUH721010ALE604
Serial Number:    2TJ94RND
LU WWN Device Id: 5 000cca 26ae0581b
Firmware Version: LHACW38Q
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5671
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Apr 22 11:07:26 2025 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 249) Self-test routine in progress...
                                        90% of test remaining.
Total time to complete Offline 
data collection:                (   93) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1054) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   134   134   054    Pre-fail  Offline      -       96
  3 Spin_Up_Time            0x0007   163   163   024    Pre-fail  Always       -       358 (Average 450)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       71
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18
  9 Power_On_Hours          0x0012   092   092   000    Old_age   Always       -       56799
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       44
 22 Helium_Level            0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   097   097   000    Old_age   Always       -       3808
193 Load_Cycle_Count        0x0012   097   097   000    Old_age   Always       -       3808
194 Temperature_Celsius     0x0002   187   187   000    Old_age   Always       -       32 (Min/Max 18/51)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     52241         -
# 2  Short offline       Completed without error       00%     45309         -
# 3  Short offline       Completed without error       00%     45291         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try 'smartctl -x' for more

Please add the output of sudo sas2flash -list, posted as formatted text (</> button) rather than as text attachement.
The drive you’ve posted a SMART report for looks healthy. We have to trust that you took the right drive.

I’ve cross-referenced the dataset error with the drive so it is the correct one.
I probably should have mentioned that the 3 604 drives are SATA and connected directly to the motherboard. I’ve made the changes to the original post and thank you for telling me how to add the text

LSI Corporation SAS2 Flash Utility
Version 20.00.00.00 (2014.09.18) 
Copyright (c) 2008-2014 LSI Corporation. All rights reserved 

        Adapter Selected is a LSI SAS: SAS2308_1(D1) 

        Controller Number              : 0
        Controller                     : SAS2308_1(D1) 
        PCI Address                    : 00:09:00:00
        SAS Address                    : 56c92bf-0-0017-80b8
        NVDATA Version (Default)       : 14.01.00.06
        NVDATA Version (Persistent)    : 14.01.00.06
        Firmware Product ID            : 0x2214 (IT)
        Firmware Version               : 20.00.06.00
        NVDATA Vendor                  : LSI
        NVDATA Product ID              : SAS9207-8i
        BIOS Version                   : 07.39.02.00
        UEFI BSD Version               : N/A
        FCODE Version                  : N/A
        Board Name                     : SAS9207-8i
        Board Assembly                 : N/A
        Board Tracer Number            : N/A

        Finished Processing Commands Successfully.
        Exiting SAS2Flash.
1 Like

No ECC RAM???

my cpu doesn’t support it

Latest is 20.00.07.00 so you’re almost good. The HBA suspect is released from custody by this investigator.

If ECC ram would help i might be able to swap with a regular 5600 which should allow me to use ECC ram.

But then your gaming MB may not boot without a GPU…
Keep an eye on it, run regular SMART tests and scrubs. But if nothing comes out, it may have been a fluke (or a RAM error).

Have you run another. SCRUB to verify no new errors or corrupt files?

If that turns out okay, then run a clear to reset the cksum.

1 Like

doing one now

Without ECC, its possible for a bit flip to be stored on a drive and hence there are chechsum errors. Changing out the RAM afterwards doesn’t help.

My observation is that more of these mystery pool issues are with systems without ECC. However, its difficult to connect the dots. Its like global warming and large storms.

1 Like

I just ran a scrub and now its saying i have an error on a different drive… not the same one
Ram issues seem to be looking like the issue. I’ll turn the system off and boot Memtest86 tomorrow and see if it catches anything. If that’s the issue i can do the CPU swap and get some new ram.

So both long smart test passed on both drives. I’ll be doing a MEMtest later tonight

I think we have found our culprit. Thank you everyone for all the help i was about to poop myself if i needed to buy 2 more drives(prices in Canada are stupid). but Ram is no problem

1 Like

Another advertisement for ECC RAM.
Its like seatbelts for your corvette… you hope you never need them.

1 Like

You should have been on the IX Systems video then about ECC ram! The conclusion was (the way I viewed it) ECC was not very important in IX view.

I didn’t say that these issues happen very often. But when there are hundreds of thousands of users, some users will experience issues.

No-one denies that ECC doesn’t solve some issues.

1 Like

I agree with you, but as I recall, Kris said something like I don’t use it because I have never had any memory errors in my life, it’s ok if someone wants extra security. Kind of a pooh pooh of the issue. I agree it’s not “required”, but I do feel it’s very important. Memory erros obviously happen, it ages, etc.

Put another way:

As illustrated by OP memory errors occur, and memory sticks go bad.
If it happened to you, how long would it it take to track down the issue? How much do you value this time?
What is the extra cost of building a NAS with ECC memory on a fully ECC qualified platform which would clearly and immediately report memory errors?

Comparing the two costs is left as an exercise for the gentle reader…

I mean the price difference between ECC and non ECC ram for my board isn’t massive i can get 64gb ECC for about 170~ and non-ECC is 145~ so to me its no biggy. 30$ for extra peace of mind is 100% worth it especially since I’m going to be getting another pool goin in a bit. I ran a scrub with the bad stick pulled out and it didn’t come up with any error so we 100% good

2 Likes