Pool Write Error Question

Devin_Wright · April 22, 2025, 6:06pm

My pool is telling me that I have an unrecoverable error, however, in the SMART test I don’t see anything. What am I missing?

joeschmuck · April 22, 2025, 10:54pm

ZFS is a file system which is where the error is being reported which can be caused by corrupt data not directly related to the storage drive, or caused by a drive. But it doesn’t mean a drive is at fault.

Take a look at my Flowcharts and these should help you fairly quickly.

Good Luck and let me know if I could improve on the flowcharts

joeschmuck · April 23, 2025, 1:46pm

Just to follow-up, I do not suspect a drive failure, not until you actually find a smoking gun. If you do not find a drive failure then take a look at system stability.

I do not know your system configuration, you didn’t post it, but here are the things I’d look for:

Are you powering or rebooting the system often?
Did you power off/on or reboot just before the problem occured?
Do you have enough RAM and is it ECC?
Have you run Memtest86+ for a few days and get all PASSED. I ran my system last week for 5 days solid. I was looking to see if I have an ESXi problem or hardware problem. Looks like ESXi, but I use the free version so no support.
Have you run a CPU Stress Test for about 4 hours (you can do this for a solid month if your system was built correctly and server grade)?

Basically, if there is no drive failure and you only have a ZFS corruption issue, think about what changed in the past week or so.

Devin_Wright · April 23, 2025, 9:10pm

It’s just getting worse. I tried a scrub and one drive failed so replaced it and now lots of errors.

Devin_Wright · April 24, 2025, 2:28pm

Actually, a lot of changes recently. I created a new VM, Proxmox, on a different server with a different HBA card moved the drives to that server and imported the pool and that worked surprisingly well and it ran for a bit without issues. And it was CORE to SCALE as well. During the scrub one HDD status became REMOVED so I replaced it but still lots of errors. I doubled the RAM on the VM when I created the new VM as well from 32GB to 64GB and it is ECC RAM and a faster more powerful server. There is another HBA card attached to that VM and that pool has been fine. Maybe it’s that HBA card.

joeschmuck · April 24, 2025, 3:25pm

Just to make sure that I am clear on the current configuration:

Proxmox on bare metal
TrueNAS VM on Proxmox (with 64GB RAM). Is that virtual RAM or real physical RAM? I know some hypervisors will let you over-allocate RAM, just like CPUs.
Is the HBA card passed through to the TrueNAS VM? Or did you map individual drives?

Yes sir, you did make a lot of changes at once. Can you roll back to your previous CORE VM and connect the drives to see how it operates? I use ESXi and TrueNAS VM on top of that so I have several TrueNAS VMs for testing but use the same pool, but not at the same time of course.

Hopefully you have not “Upgraded the ZFS Feature Set”. Do not do it and you should be able to roll to the earlier version without any real issue.

Wish I could be of more help but I think you have a lot to look at. Make one change at a time, especially if you can roll back to CORE and be fully operational.