Checksum errors...CAN'T FIGURE THEM OUT!

TheLynchMob · May 10, 2025, 1:46am

I’m having some major problems with checksum errors and can’t figure out why. First off my NAS is only for media, for Plex. I don’t run Plex on it, as I do that from my PC…just media storage.

Z170N-WIFI
i7-7700
16GB RAM
NVME boot drive
3 x 12TB Seagate Iron Wolf Pros
2.5 gig NIC
Jonsbo case

The issues started immediately upon building my new NAS. When I first started copying over data I was getting tons of I/O errors, come to find out it was the thin SATA cables that were bad, so I replaced them and those went away. That’s when the checksum errors started, within a couple days. I thought it might be the drives so I replaced them with the new ones…same thing. I then thought it was corruption from the I/O errors, so I started over…deleted the dataset, created new, and copied everything back over with NO I/O errors. Over 30k checksum errors on each drive. I tried some of the files it said was corrupt…they are perfectly fine.

What do I do now? Disable checksum? Possible RAM issue? What…I am lost and this is frustrating. If I can’t figure it out it’s time for another NAS OS.

Thanks in advance and sorry for the long post.

winnielinnie · May 10, 2025, 1:57am

Check the backplane and connections. This sounds like a loose cable or connection.

Which Jonsbo case is it? How easy is it to access the SATA backplane?

TheLynchMob · May 10, 2025, 2:04am

Jonsbo N1. I don’t believe it would be too hard to remove. I have 5 slots on it…worth moving 2 drives to the 2 empty ones I haven’t ever used??

winnielinnie · May 10, 2025, 2:49am

I suspect there’s a single point of failure. It’s unlikely that all your drives have the same issue, or that all connections to the drives are bad.

Try to access the backplane without any drives installed, and check its connections to the the motherboard.

joeschmuck · May 10, 2025, 2:50am

@TheLynchMob
This is actually a fairly easy problem to start isolating.

Run MemTest86+ on your RAM for 5 complete passes. 16GB RAM should not take very long so you might consider just running it overnight.
Run a CPU Stress Test for at least 4 hours, longer is better but 4 hours is in my opinion a reasonable amount of time to weed out any gross errors. I personally ran my last build CPU Stress Test a few weeks ago and ran it for over 24 hours. I want to make damn sure the hardware that I can test easily, is tested right away. I ran the MemTest86+ for several days, 3 I think it was.
Run smartctl -a /dev/sda and examine the output. Are there any errors? You can just post the results of each drive and we will examine and tell you the things to identify. This data can tell you if you have a SATA cable issue, look at UDMA_CRC_ERRORS RAW value. This value will never decrease, it lives forever with the drive so note the value if it is greater than zero. If the value increments, replace the SATA cable for that drive. Does the value continue to increase? If yes, plug that drive into a different SATA port, does it still increased. You get the idea. Isolate the issue.
If you have not already done so recently, run a Smart Long Test on each of the drives. You may want to only run a few drives at a time, that is up to you. Once each drive has completed the extended test, do step 3 again.
Look at the Drive Troubleshooting Flowcharts link in my signature. It has a great deal of information to help you out.

This should be enough to keep you busy for a while.

Something else you can do is run a Scrub. Then zpool status -v and you know what, read the flowcharts…

etorix · May 10, 2025, 10:48am

That’s where the problem begins… It is actually very easy to bend or tear off the SATA connectors from the N1 backplane.

TheLynchMob · May 10, 2025, 1:59pm

Thanks for the suggestions. I had been reading that RAM might be the issue, some said it could be while others said it couldn’t…so I just changed it with some other I have and ran a scrub. 100% good now!

winnielinnie · May 10, 2025, 2:01pm

I’m surprised that your system didn’t crash with such bad RAM that results in 30,000+ checksum errors.

TheLynchMob · May 10, 2025, 2:16pm

Me too, that’s why I didn’t think RAM. It’s the 3rd set of Corsair memory in the last few months that I’ve had be bad…NEVER again with them.

joeschmuck · May 10, 2025, 4:53pm

Glad it seems to be working but did you run any diagnostic test?

TheLynchMob · May 10, 2025, 6:43pm

Just a scrub, which always found errors the whole run. I will test the old RAM thoroughly when I get to my computer shop Monday to make 100% sure.

joeschmuck · May 10, 2025, 11:43pm

Yup, you really should run those two main tests, Memtest86+ and a CPU Stress Test like Prime95 or similar. A scrub only validated the data on the drive is what it thinks it should be. You may need to check to see if your data is actually good. It could have become corrupt before writing to the pool while it was waiting in RAM.

Hopefully anyone thinking about building a NAS will read this thread and realize that we are not recommending these diagnostic tests for the heck of it. There is a real reason and need. You must have solid stable hardware, if not then your data is not safe from corruption at all.

Okedokey · May 11, 2025, 12:49am

Agreed. The memory controller is on the CPU so if that is faulty you’ll keep getting RAM related issues.