Endless checksum errors between multiple drives

I have a 3 drives setup, and usually power off because i don’t access it often.

Recently, HexOS forwarded the warning that one of my drive was REMOVED…and nope, go to the web UI, it says everything is normal. So it just somehow got disconnected for a moment.

This drive also gave me tons of warning but not anything wrong upon scanning 2 months ago, now it looks like the problem is outside the drive itself.

I move the drives out of the case and put them in an external shelf, change to a new 4 to 4 SATA cables, power switch to a 1 to 4 splitter, then power on…

There are LOADS of problem.

Some of them (Can’t start docker, drive sounds constantly and consistently shutting down, scrub taking months) never came back once I connect the SATA power back to PSU one.

Some other thing are not. Namely: Report doesn’t work, and endless piling checksum errors. Not sure if they’re related.

Report stops working is just that, the report page is empty. The home page also doesn’t show any status.

I’m not sure exactly WHEN this became a thing since I don’t check it often, but I’m sure it used to work.

Checksums errors are interesting…because it’s the same amount between 2 drives.

THE 2 DRIVES THAT NEVER HAD AN ISSUE. The trouble maker drive now gives no error, what.

Basically, the longer the system runs, the more error it raises, like 400 of them within 10 minutes.

SMB is still up, randomly choose a file to open usually works, but loads of checksum errors.

0 read error, 0 write error, 400 checksum errors, same thing on 2 drives and a clean 3rd drive.

Tried switching port and switching SATA cable, did nothing.

Ran a 6 hours scrub, still spitting errors.

truenas_admin@HexOS[~]$ sudo zpool status -v
  pool: HDDs
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 18.6M in 06:49:35 with 108 errors on Mon Nov  3 08:35:52 2025
config:

        NAME                                      STATE     READ WRITE CKSUM
        HDDs                                      ONLINE       0     0     0
          raidz1-0                                ONLINE       0     0     0
            00f9e94d-2c50-4903-b984-501286530311  ONLINE       0     0   364
            974f1184-6c2d-49a4-b19f-f675387f7bbb  ONLINE       0     0     0
            98f8ae81-f736-414d-8afa-ada57c274e28  ONLINE       0     0   364

errors: Permanent errors have been detected in the following files:

        /var/db/system/netdata/context-meta.db-wal
        /var/db/system/netdata/netdata-meta.db
        /var/db/system/netdata/netdata-meta.db-wal
        /var/db/system/netdata/dbengine/journalfile-1-0000000057.njf
        /var/db/system/cores/core.netdata.999.71c07f17d3fd46deb40c2629586f2aa5.69260.1762103223000000.zst
        /var/db/system/cores/core.netdata.999.71c07f17d3fd46deb40c2629586f2aa5.71272.1762103613000000.zst
        /var/db/system/cores/core.netdata.999.71c07f17d3fd46deb40c2629586f2aa5.74868.1762104259000000.zst
        /var/db/system/cores/core.netdata.999.71c07f17d3fd46deb40c2629586f2aa5.67888.1762102978000000.zst
        /var/db/system/cores/core.netdata.999.71c07f17d3fd46deb40c2629586f2aa5.69885.1762103368000000.zst
        /var/db/system/cores/core.netdata.999.71c07f17d3fd46deb40c2629586f2aa5.71963.1762103742000000.zst
        /var/db/system/cores/core.netdata.999.71c07f17d3fd46deb40c2629586f2aa5.64957.1762102445000000.zst
        /var/db/system/cores/core.netdata.999.71c07f17d3fd46deb40c2629586f2aa5.73343.1762104014000000.zst

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:21 with 0 errors on Mon Nov  3 03:45:23 2025
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          sdb3      ONLINE       0     0     0

errors: No known data errors

So…Same errors across 2 drives suggested that it’s not a drive problem. But then what?

SATA cable is not a factor, SATA port is not a factor, my only idea left is just get a new rig at this point.

But then, is checksum error a necessity to be dealt with? Are my data bit by bit started dying, or is it just freaking out on insignificant mismatch value?

Thanks for reading.

Yes and yes. That’s what checksums are about. They guarantee the integrity of your data. If they fail it’s because your data is damaged.

The cause for this can vary, of course.

What exactly is that shelf you are using? How is it connected to your TrueNAS? What mainboard, what type of memory? Also, possibly, power supply issues, … a ton of different potential causes.

It looks like this but for 4 drives and a single fan. Now I think about it, I should probably call it “cage“.

My question is, why is it 2 out of 3 drives reporting the same issue?

If it’s a drive problem, they shouldn’t be syncing; if the whole pool or whole system have problem, there shouldn’t be a drive reporting no error before and after cable switches…right?

I’ve thought about CPU or ram, but if they have a problem, the system shouldn’t be booting at all…?

Hopefully it’s an OS issue, but then I have no idea how to troubleshoot that.

2 questions:

1: What kind of drives are you using?

If they aren’t designed for NAS use, they may not be tolerant of each other’s vibrations.

2: How long of a run do you have between the drives and your motherboard/controller?

I think 3 feet is the limit for SATA, any longer and the data can’t be guaranteed.

Bad news: in the vast majority of cases it is not. It is almost always hardware issues.

As @Lylat1an asked: type of drives, length of SATA cables, please.

Also: what is on the other end? Mainboard, SATA chipset, type of CPU and memory, cooling, power supply … everything.

Some super cheap Seagate Exos 12TB drive, so cheap that it’s probably a fleshed mining drive.

Vibration…never thought about that. I’ll try to put them on a towel and do a test tun.

50 cm.

CPU: Intel(R) Core™ i3-6100 CPU @ 3.70GHz
Board: D820MT_D820SF_BM3CE
Memory: 8 GiB
PSU: ORUNBO 750W

I unplugged the drives and opened the system to get hardware info, then the system report suddenly works again…What?
The system reporting got stopped by…hard drives? Why?

I think we can rule out the cable lengths being a problem.

And the Exos series appear to be datacenter oriented, so I imagine they have vibration considerations.

As for your latest post, it could be that the deal you got was “Too good to be true”

Are you able to show us the SMART data from your drives? That might show a clue to someone who knows more than I do, but I’m suspecting they’ve exceeded their MTBF.

Yeah no, same thing.
And reporting is down again. Seem like this pool being online stops CPU temperature report…just how.

Tried connect 1 and 2 drives…yes, the report only stops when the pool is valid (at least 2 drives).

Okay, if no one has any idea what to do, I’ll just run scrub again later

I’d shutdown the NAS and run a memtest before everything else.

Also 8 G is really the absolute not recommended bare minimum for the software to start at all. Maybe you run into problems caused by memory pressure.

That’s an idea…I’ll give it a shot, thanks.

Big news: The Problem is on the drive.
I should’ve think of this sooner, just connect the drives including the OS to my main rig.
It fails to connect to the net, but go into shell and do sudo zpool status, and there it is, same 2 drives piling up checksum errors.
Literally everything except the drives are swapped, and the issue persists. Turns out the first thing got ruled out was the answer.

So, what gives?
If I have to guess, it’s probably the initial insufficient power caused something.
One thing I notice this time: The errors are increasing by 4 about every 2 seconds, and it goes as long as the system is running. I ran scrub, and woke up with 50k of errors.
That probably means it’s just the same thing causing errors over and over again, not really something spreading. What could be it?
Looking back at the zpool status dump:

errors: Permanent errors have been detected in the following files:

        /var/db/system/netdata/context-meta.db-wal
        /var/db/system/netdata/netdata-meta.db
        /var/db/system/netdata/netdata-meta.db-wal
        /var/db/system/netdata/dbengine/journalfile-1-0000000057.njf
        /var/db/system/cores/core.netdata.999.71c07f17d3fd46deb40c2629586f2aa5.69260.1762103223000000.zst
        /var/db/system/cores/core.netdata.999.71c07f17d3fd46deb40c2629586f2aa5.71272.1762103613000000.zst
        /var/db/system/cores/core.netdata.999.71c07f17d3fd46deb40c2629586f2aa5.74868.1762104259000000.zst
        /var/db/system/cores/core.netdata.999.71c07f17d3fd46deb40c2629586f2aa5.67888.1762102978000000.zst
        /var/db/system/cores/core.netdata.999.71c07f17d3fd46deb40c2629586f2aa5.69885.1762103368000000.zst
        /var/db/system/cores/core.netdata.999.71c07f17d3fd46deb40c2629586f2aa5.71963.1762103742000000.zst
        /var/db/system/cores/core.netdata.999.71c07f17d3fd46deb40c2629586f2aa5.64957.1762102445000000.zst
        /var/db/system/cores/core.netdata.999.71c07f17d3fd46deb40c2629586f2aa5.73343.1762104014000000.zst

Nope, no idea what it means, but judging from the directory, my data is probably fine.
So…this was on the post for a day at this point, and no one pointed this out…I’ll just throw it to Gemini.

The files listed are primarily related to Netdata, which is the system monitoring and performance analytics tool used by TrueNAS, along with several core dump files (crash reports) for the Netdata process.
That…makes a ton of sense, this perfectly explained why system monitoring breaks whenever the pool is loaded. Everytime it try to update system status, errors are thrown, therefore consistently piling up.
Gemini then suggest service netdata stop, which does stop the errors from piling up.
It then claim the data are non-critical and safe to remove, and I can run

# Remove all netdata files (this will clear monitoring history)
rm -rf /var/db/system/netdata
# Remove the corrupted core dumps
rm -rf /var/db/system/cores

…I’m not trusting that just yet.

Is this safe to run? How likely is this going to solve my problem? What’s the risk here?
Thanks for reading.

Because your NIC probably has a different name. You can try to fix it with Configure network adapters or Configure network settings (I don’t remember which is which). If you succeeded, you will have the up-and-running truenas gui.


I’m not entirely sure whether these are viable solutions, so if you decide to follow one of them, do it at your own risk :warning:.


It seems like you have problems with your boot drive, so in my opinion the best solution would be:

  1. Restore web gui access.
  2. Go to the System → General Settings → Manage Configuration → Download file. Save your configuration.
  3. Turn off truenas.
  4. Change the boot drive with the new one.
  5. Install the same truenas version on the new boot drive.
  6. Restore configuration with System → General Settings → Manage Configuration → Upload file.

If you want to try to go with the same drive and just remove problematic files, then:

  1. Save your configuration (just in case).
  2. Go to System → Boot. Pick the currently active boot environment and make a clone of it. Give it some meaningful name like <version>-remove-netdata-test.
  3. “Activate” this environment.
  4. Reboot. During the boot there would be a grub menu with the list of available boot environments. If you activated your cloned environment it would be selected by default.
  5. After the boot, remove the files as you wanted. Reboot. Do some testing; if it would help, then you are ok.
  6. If removal somehow breaks the system (or even makes it unbootable), select your previous boot environment during the boot process in the grub menu.
  7. In the System → Boot remove your clone and activate the current one.
  8. You are back on square 1 now (with errors in netdata files).

Thanks. Yeah, if it’s an OS problem, worst case just get a new install, should’ve thought about that.
I did the boot backup and ran
sudo rm /var/db/system/cores/*
rm -r /var/db/system/netdata/*
And now… System report is now functioning, and checksums errors are growing super slow now.

errors: Permanent errors have been detected in the following files:

        HDDs/.system/netdata-ae32c386e13840b2bf9c0083275e7941:<0xc>
        HDDs/.system/cores:<0x12d>

That’s…progress.
I’m running scrub again just in case.

1 Like

Nope, scrub doesn’t fix this…I’ll figure this out tomorrow.

Scrub is not going to fix a hardware problem and growing checksum errors are almost certainly a hardware problem.

3 Likes

…That is super unhelpful information after the problem had been narrowing down so drastically.