I’m having some major problems with checksum errors and can’t figure out why. First off my NAS is only for media, for Plex. I don’t run Plex on it, as I do that from my PC…just media storage.
Z170N-WIFI
i7-7700
16GB RAM
NVME boot drive
3 x 12TB Seagate Iron Wolf Pros
2.5 gig NIC
Jonsbo case
The issues started immediately upon building my new NAS. When I first started copying over data I was getting tons of I/O errors, come to find out it was the thin SATA cables that were bad, so I replaced them and those went away. That’s when the checksum errors started, within a couple days. I thought it might be the drives so I replaced them with the new ones…same thing. I then thought it was corruption from the I/O errors, so I started over…deleted the dataset, created new, and copied everything back over with NO I/O errors. Over 30k checksum errors on each drive. I tried some of the files it said was corrupt…they are perfectly fine.
What do I do now? Disable checksum? Possible RAM issue? What…I am lost and this is frustrating. If I can’t figure it out it’s time for another NAS OS.
@TheLynchMob
This is actually a fairly easy problem to start isolating.
Run MemTest86+ on your RAM for 5 complete passes. 16GB RAM should not take very long so you might consider just running it overnight.
Run a CPU Stress Test for at least 4 hours, longer is better but 4 hours is in my opinion a reasonable amount of time to weed out any gross errors. I personally ran my last build CPU Stress Test a few weeks ago and ran it for over 24 hours. I want to make damn sure the hardware that I can test easily, is tested right away. I ran the MemTest86+ for several days, 3 I think it was.
Run smartctl -a /dev/sda and examine the output. Are there any errors? You can just post the results of each drive and we will examine and tell you the things to identify. This data can tell you if you have a SATA cable issue, look at UDMA_CRC_ERRORS RAW value. This value will never decrease, it lives forever with the drive so note the value if it is greater than zero. If the value increments, replace the SATA cable for that drive. Does the value continue to increase? If yes, plug that drive into a different SATA port, does it still increased. You get the idea. Isolate the issue.
If you have not already done so recently, run a Smart Long Test on each of the drives. You may want to only run a few drives at a time, that is up to you. Once each drive has completed the extended test, do step 3 again.
Look at the Drive Troubleshooting Flowcharts link in my signature. It has a great deal of information to help you out.
This should be enough to keep you busy for a while.
Something else you can do is run a Scrub. Then zpool status -v and you know what, read the flowcharts…
Thanks for the suggestions. I had been reading that RAM might be the issue, some said it could be while others said it couldn’t…so I just changed it with some other I have and ran a scrub. 100% good now!
Yup, you really should run those two main tests, Memtest86+ and a CPU Stress Test like Prime95 or similar. A scrub only validated the data on the drive is what it thinks it should be. You may need to check to see if your data is actually good. It could have become corrupt before writing to the pool while it was waiting in RAM.
Hopefully anyone thinking about building a NAS will rad this thread and realize that we are not recommending these diagnostic tests for the heck of it. There is a real reason and need. You must have solid stable hardware, if not then your data is not safe from corruption at all.