Topic for reference.
Update on my issue:
Itās none of the drives. Itās not the RAM. Itās not the PSU.
For now, itās the SATA controller or something else on the motherboard.
I have backups of my stuff, but I am currently here with my ZFS pool: āPermanent errors have been detected in the following filesā, so the pool is damaged.
ChatGPT: The failure is synchronous I/O overloading the motherboard SATA/PCIe path
Looking at completly new hardware at the moment, enterprise level.
He is our Jester, I can confirm it. He keeps me laughing.
The part about RAM testing where some folks feel it is not required, of course that is not true. This goes for the CPU as well as all the electronics. Failures due happen and many start out as intermittent, which sucks to troubleshoot.
This is good. You should purchase what hardware you need based on your Use Case of the NAS. The only āfuture proofingā I would advice is having enough RAM. If your use case does not include running apps/containers, then you would be likely perfectly fine with 16GB of RAM. However if you do desire to run apps in the future, purchase the RAM you feel you might need in a few months or a year.
As for the system you have right now, Iām not certain about what you have or havenāt done recently so I will just start at the beginning.
Questions:
- What are your current failing indications? I suspect it is ZFS corruption issues since you replace a failing drive already.
- Do the failing indication move from one drive to another or is it always a set of specific drives, or all drives? If you have say two drives that continue to fail, SCRUB the pool, Clear the errors, Scrub the pool, Ensure no new errors. Then, and only then, swap one of the failing drives with a known good drive. Why? We are verifying the hardware data connectivity and power. If the problem moves to the new drive then you have verified the hardware failure.
- If the failures are random, have you performed a CPU Stress Test (like Prime95), and let it run for at least 4 hours, Iād let it run all day since you are troubleshooting a problem. Yes, the CPU and motherboard will get hot, that is part of the test as well to check for thermal connection issues.
That is the start.
I would still recommend you purchase appropriate hardware. I donāt know what CPU you currently have but you may be able to reuse it. Here is a link to my AMD build. I have listed the good and bad things about my motherboard choice, but the system is rock solid and Iāve been running TrueNAS SCALE on bare metal for a few months (I like ESXi but testing some features required me to run bare metal). If you have any questions about this, just ask. And Iām not suggesting you purchase this setup, it is only one example.
I took each hardware piece to another PC and tested it, except CPU and Motherboard.
I clear the checksum errors on ZFS, run a Scrub on the pool and each and every time the Scrub results in problems. From 4 out of 4 drives in the pool each have some number of checksum errors.
I will do this once I migrate my data and setup the new NAS. I narrowed it down to the CPU + Motherboard and from all the things I suspect a degrading AMD 5600X CPU or the Sata Controller on the board failed.
Iām eyeing the ThinkStation P500 Platform, it hits all the marks for what I see.
Looks good.
Question: Do you need the cache drive? If you have more RAM, then you will definately be faster than the cache drive. But if you are accessing the same files over and over again, about 1.5TB of files, then the cache makes sense to me. Iām trying to save you a headache by trying to help your configuration. And if this is to augment your NVMe pool (VM, Containers) then all I see is the addition of a slow down, assuming the ZFS cache works like it use to a few years ago (first it checks the cache for the files, if not found then it checks the pool) which would add latency.
Of course, you can do things in your own way, it is only friendly advice.
That is a good question I asked myself too. Looking at the metrics it seems I donāt need it. There is not so much load on the pools anyway.
And that is one less possible problem in the future.
This seems oddly familiar with the issue I had a few months ago. You can read more here.
Before dumping money on new HW, I would give a try to destroy and create the pool again (if youāve the backups to do it)
Yes, this looks very similar.
The machine was decommissioned, and I already got my hands on a used Lenovo P500 Workstation.