Pool Write Error Question

My pool is telling me that I have an unrecoverable error, however, in the SMART test I don’t see anything. What am I missing?


ZFS is a file system which is where the error is being reported which can be caused by corrupt data not directly related to the storage drive, or caused by a drive. But it doesn’t mean a drive is at fault.

Take a look at my Flowcharts and these should help you fairly quickly.

Good Luck and let me know if I could improve on the flowcharts

2 Likes

Just to follow-up, I do not suspect a drive failure, not until you actually find a smoking gun. If you do not find a drive failure then take a look at system stability.

I do not know your system configuration, you didn’t post it, but here are the things I’d look for:

  1. Are you powering or rebooting the system often?
  2. Did you power off/on or reboot just before the problem occured?
  3. Do you have enough RAM and is it ECC?
  4. Have you run Memtest86+ for a few days and get all PASSED. I ran my system last week for 5 days solid. I was looking to see if I have an ESXi problem or hardware problem. Looks like ESXi, but I use the free version so no support.
  5. Have you run a CPU Stress Test for about 4 hours (you can do this for a solid month if your system was built correctly and server grade)?

Basically, if there is no drive failure and you only have a ZFS corruption issue, think about what changed in the past week or so.

1 Like

It’s just getting worse. I tried a scrub and one drive failed so replaced it and now lots of errors.

Actually, a lot of changes recently. I created a new VM, Proxmox, on a different server with a different HBA card moved the drives to that server and imported the pool and that worked surprisingly well and it ran for a bit without issues. And it was CORE to SCALE as well. During the scrub one HDD status became REMOVED so I replaced it but still lots of errors. I doubled the RAM on the VM when I created the new VM as well from 32GB to 64GB and it is ECC RAM and a faster more powerful server. There is another HBA card attached to that VM and that pool has been fine. Maybe it’s that HBA card.

Just to make sure that I am clear on the current configuration:

  • Proxmox on bare metal
  • TrueNAS VM on Proxmox (with 64GB RAM). Is that virtual RAM or real physical RAM? I know some hypervisors will let you over-allocate RAM, just like CPUs.
  • Is the HBA card passed through to the TrueNAS VM? Or did you map individual drives?

Yes sir, you did make a lot of changes at once. Can you roll back to your previous CORE VM and connect the drives to see how it operates? I use ESXi and TrueNAS VM on top of that so I have several TrueNAS VMs for testing but use the same pool, but not at the same time of course.

Hopefully you have not “Upgraded the ZFS Feature Set”. Do not do it and you should be able to roll to the earlier version without any real issue.

Wish I could be of more help but I think you have a lot to look at. Make one change at a time, especially if you can roll back to CORE and be fully operational.

Yes Proxmox on bare metal with 64GB of RAM on the VM set with ballon=0 so from what I understand that is dedicated physical RAM. The HBA card is passed through to the VM, as well as another one. I did do the “Upgraded the ZFS Feature Set” as I needed to do that on the other pool to extend it and that all worked fine too. I figured since it was fine there I did it on this pool tool. I have this pool replicated to the other pool and another server so worst case I can blow it all away and start over but would like to figure out if it’s the HBA card or not, I’ll keep working on it. Thanks for the help.

Just noticed I’m getting this over and over again in the console for that VM

Something you might consider trying…

  1. If you haven’t already done so, save a copy of your TrueNAS configuration.
  2. Install TrueNAS to a small drive (could be a USB stick for this testing period).
  3. Boot TrueNAS from the drive, just bare metal, no Proxmox.
  4. See how everything is doing.

If all is good, run like that for a while to just make sure all remains good. If it is still good after 1 week, then you know Proxmox is likely not setup correctly.

If the system is still screwed up, you can suspect a hardware issue.

This will help isolate some things quickly.