Storage issue. Checksum errors

Hello,

I’m new to TrueNAS (and any NAS or even linux…)
Build a home server, installed TrueNAS (now updated to 25.10.0)
configured the SMB shares, copied ~1TB of family media, all was good…
Then installed Jellyfin, PiHole.. looked all good. Copied few TV shows for Jellyfin Media, also had no issues. After few days copied some more series, and later noticed and error at Storage that ZFS Health - “Pool is not healthy”. Tried the scrub, was not gone after,
Check the zpool status and found multiple Checksum Errors for the newly copied TV shows (multiple random files)

So I have deleted them, and copied again from my PC. Reran scrub, and the health returned for the pool.

Thought it was just one time thing… But not… Happened again, with some other files (some Game mods I think…)

So I have tried to investigate the issue.

Ran SMART tests for the HDDs - no errors.

Tried to monitor checksums before and after some test files, and sometimes random file got corrupted after copying to NAS.

Then tried read local read test, to avoid network part of the equation,
ran this script:

LOGDIR=/mnt/tank/shared/NAS_Test
OUT=“$LOGDIR/read-stress.log”
ERR=“$LOGDIR/read-stress-errors.log”
PIDFILE=“$LOGDIR/read-stress.pid”
echo $$ > “$PIDFILE”
echo “=== READ STRESS START $(date) ===” >> “$OUT”
echo “=== READ STRESS START $(date) ===” >> “$ERR”

Loop forever reading every file one by one

while true; do
find /mnt/tank/shared/NAS_Test -type f -print0 | while IFS= read -r -d ‘’ f; do
echo “READ $(date +%Y-%m-%dT%H:%M:%S) $f” >> “$OUT”

try reading with dd; dd returns non-zero on I/O error

if ! dd if=“$f” of=/dev/null bs=1M status=none conv=sync 2>>“$OUT”; then
echo “ERR $(date +%Y-%m-%dT%H:%M:%S) $f” >> “$ERR”
echo “---- dmesg tail at $(date) ----” >> “$ERR”
dmesg | tail -n 80 >> “$ERR”
echo “---- end dmesg ----” >> “$ERR”
fi
done
done

And the result was, that sometimes some random file got error when reading (Input/output error). But then the same file was passing without error, so that would suggest, that the reading process got the error, not the file it self is corrupted.

Then tried local write test for each of the HDDs with this

sudo nohup dd if=/dev/sda of=/dev/null bs=1M status=progress > /mnt/tank/shared/NAS_Test/dd-sda.log 2>&1 & echo $! | sudo tee /mnt/tank/shared/NAS_Test/dd-sda.pid
(and sdb for the second)

ran them separate, - no error, then ran together till reach 2TBs each… no errors…
So this would mean, that each of the disk write have no issue, only when using ZFS?

And now, I do not know what to do next? How to find the root cause of this?

My hardware:
CPU: INTEL Core i5-14500
MB: ASUS PRIME B760M-A WIFI D4
RAM: G.Skill | Ripjaws V | 32 GB | DDR4 | 3600 MHz | CL16
OS SSD: Samsung 970 evo 500GB
Storage HDDs: 2x SEAGATE NAS HDD 8TB IronWolf 7200rpm ( VDEV created with Mirror layout)
PSU: be quiet! Pure Power 13 M | 850W

Any suggestions (as detailed as possible) would be appreciated.

System stability comes to mind when you are thinking it is a ZFS issue.

Instructions:

  1. Post the output of zpool status -v so we can see what is being reported.
  2. If you have CKSUM or other errors, run zpool scrub poolname to start a scrub.
  3. If the scrub finishes without any errors (bytes repaired = 0) then run zpool clear poolname to clear the errors.
  4. Run memtest86+ for at least 5 complete passes.
  5. Next run Prime95 or similar for at least 4 hours to heat up the CPU and motherboard. We are not only checking for a suspect CPU, but also bad solder joints.

This is what I recommend you start with.

And welcome to the forums.

2 Likes

zpool status -v

pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:00:04 with 0 errors on Fri Nov 7 03:45:05 2025
config:

    NAME         STATE     READ WRITE CKSUM
    boot-pool    ONLINE       0     0     0
      nvme0n1p3  ONLINE       0     0     0

errors: No known data errors

pool: tank
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: Message ID: ZFS-8000-8A — OpenZFS documentation
scan: scrub repaired 0B in 01:31:04 with 0 errors on Thu Nov 6 12:03:35 2025
config:

    NAME                                      STATE     READ WRITE CKSUM
    tank                                      ONLINE       0     0     0
      mirror-0                                ONLINE       0     0     0
        e836aeb0-e0f7-4373-8297-64afec7bd2a1  ONLINE       0     0     2
        fff0bf6a-6695-4434-9cf6-ab446196060c  ONLINE       0     0     2

errors: Permanent errors have been detected in the following files:

    /mnt/tank/shared/NAS_Test/NAS_Test3/Sirens.S01E02.Talons.1080p.NF.WEB-DL.H.264-EniaHD.mkv
    /mnt/tank/shared/NAS_Test/NAS_Test/NAS_Test/Back.To.The.Future.2.1989.1080p.BluRay.x264.EN.LT-NN.mkv

Still errors in the test files. Deleting and uploading again, or just uploading to different subfolder could give different list of checksum errors

And the memtest86+ have results… and plenty of them

Tried both rams, together 2 times, both time all froze at this point (Pas89% / Test72%)
Then removed one - still got errors, placed the removed one back, and removed another, still errors…

Added same spec rams from my main PC - no errors… (with one first, then added second just to be sure)

So rams got to go, and I guess there is no need to do Prime95 test anymore?

2 Likes

Impossible! I was told by a very authoritative source that memtests are useless!

Correct. Even just a single error warrants replacing the RAM. It needs to be 100% good 100% of the time.

Next time save some money with my coupon.

2 Likes

@Gimis Glad you have isolated the problem to a rea hardware issue. Sorry it wasn’t a hard drive, that would have been cheaper. Right now you don’t know if it’s the RAM, CPU, MB, Power Supply.

Something you might try, underclock the RAM to 1600MHz, see how it reacts.

People think that you test your system when you buy it and assume it will never fail those kinds of tests in the future. These few tests are crucial and should be performed probably once a year, more often if you like, definitely if you are having odd problems. They are easy tests to run.

@joeschmuck But why there are still suspects other than RAM?
Got errors in memtest with original RAM, and no errors when switched to RAM from another PC. Or could some kind of issues on other components did some damage to original ram? Faulty PSU or MB over voltage the ram or something?

The RAM interfaces with the MB directly and then all (most) the address lines go to the CPU, and they all have to have power. Any one of these could be at fault, you kind of proved it already when you tested the RAM modules separately and both failed. Is it the RAM, I don’t know.

Here are a few questions that should be answered:

  1. Has this system EVER run before without a problem? I do mean for several months.
  2. Is the RAM on the QVL?
  3. The motherboard does seem to support the RAM speed however it is an Over Clocked setting. So slow things down to a native speed of 3200, if for nothing less than testing.
  4. You might reset the BIOS to Factory Default and then see what the RAM speed looks like and test the RAM again.
  5. Your CPU does not look over clocked, that is a good thing, as well the temperature looks good.

A test you could perform: Install only one stick of RAM into blue (or lighter colored slot) the second one nearest the CPU (slot DIMM_A2). Now test. If it passes, remove that module and install the second RAM module into the same slot DIMM_A2. If this passes, install the modules so you fill slots DIMM_A2 and DIMM_B2 (the third slot away from the CPU) and test the memory again.

The slots are important, the failing address is important.

Let’s assume the first test fails, then try different combinations of a single stick.

Do they all seem to fail in the same general area? (single stick only)

If the failures are on both stick and in the same memory test location on both sticks, then the odds of it being the RAM are very slim. Use the Failing Address as your reference, it matters if the failure moves around or always fails at a specific address range.

To Do:

  1. Install one stick of RAM into DIMM_A2.
  2. Run MemTest86+, does it fail? yes, go to next step.
  3. Reset the BIOS to factory settings.
  4. Run MemTest86+, does it fail? yes, go to next step.
  5. Remove the installed RAM stick, insert the second one.
  6. Run MemTest86+, does it fail? yes, go to the next step.
  7. Did the RAM test fail in the same location as the first RAM test? yes, go to next step, NO then you likely have incompatible RAM.
  8. Install a single stick of RAM into slot DIMM_B2.
  9. Run MemTest86+, does it fail? yes, go to next step, NO it looks like the motherboard or CPU.
  10. Replace the install stick with the other stick in slot DIMM_B2.
  11. Run MemTest86+, does it fail and fail in the same location (failing address)? yes or no

This is a lot of testing and will take time unless you have quick failures.

IMPORTANT: If MemTest86+ makes it through one complete test, keep going, let is complete 5 total tests. This is my minimum number required to realistically feel good about the RAM test.

Report back your results.

Be very attentive to what you are doing and the results. Snap a photo like you did above so you have the data for each test case, you may need to look at it again. If you get confused, you can post those images as well.

I hope this answers as to why I would not assume it is a RAM failure just yet. I’m not saying it isn’t a RAM failure however the testing steps above will help you discover if the RAM sticks are good or not at a minimum.