ZFS checksum errors due to Incus VMs on 25.04

TheJulianJES · April 18, 2025, 1:56am

Branching off my comment from the release post: TrueNAS 25.04.0 now available! - #49 by TheJulianJES

I’ll update this thread, as I get more information on this and try some things.

TL;DR is that running Incus VMs on 25.04 seemingly cause checksum errors on a different (encrypted) HDD pool when a scrub is running there.
Shutting down the VMs and unsetting the Incus pool causes subsequent scrubs to run without any issues.

System: i7 7700k/Z270F, on-board SATA controller for 3x Seagate Exos x22 22TB HDDs in a RAIDZ(1), NVMe SSD for boot, another NVMe SSD for “ssd” pool where Incus VMs were running.

Text below is a copy of the linked comment for reference:

Upgraded from RC.1 to the release version, as soon as it was available.
I previously had system freezes on RC.1, likely related to Incus VMs: TrueNAS 25.04-RC.1 is Now Available! - #141 by TheJulianJES

I did a bit of research after that post and stumbled upon this thread: Incus VM Crashing, which I also experienced on RC.1.
After upgrading to the Fangtooth release version, I imported the existing ZVOLs into the managed Incus volumes and kinda hope that the freezing and crashing problems are resolved with that (all VMs use VirtIO-SCSI). I’ll keep monitoring obviously.

However, I’ve been getting checksum errors on my main (encrypted) HDD pool on a i7 7700k/Z270F system since upgrading from 24.10 to 25.04, only 1 or 2 checksum errors per run, sometimes spread across multiple disks.
I did five(!) separate full scrubs with system reboots in between and got checksum errors every time. I obviously suspected the disks first, but they seem to check out fine. A full/long SMART test also completed without issues. It was also unlikely to have multiple disks “failing” at the same time with widely differing production dates.
Since the system doesn’t have ECC ram, I ran a multiple hour-long memory test, which checked out fine as well.

Later, I also read Frequent Checksum Errors During Scrub on ZFS Pool · Issue #16452 · openzfs/zfs · GitHub (about an AMD CPU) and someone mentioned that VMs can sometimes impact ZFS checksum calculation on “flawed” hardware.

Finally, I shut down both of my Windows Incus VMs that were running on a separate (encrypted) SSD pool, unset the Incus/VM pool to completely disable that part of TrueNAS, and did another reboot of the machine.
Now, a sixth scrub is about 70% done and I’ve got no checksum errors so far, whilst I’ve always had ones before before with 25.04 and the VMs running.
(24.10 was fine since release with the same VMs and pool, never any issues during scrubs.)

Digging into the checksum errors, the same zio_objset and zio_object were present across different scrubs. The same part of encryption metadata, written/created months ago according to ZDB on a particular dataset.
Other errors were present as well, but they differed a bit.

The same object being seemingly corrupt multiple times reinforced my suspicion of a hard drive issue at first, but I now feel like the CPU keeps messing up the same calculation for the checksum there, for some reason…
This is obviously a very weird issue, especially since 24.10 VMs weren’t seemingly affecting scrubs on an entirely different ZFS pool, but I’d be really interested to see if anyone else suddenly sees checksum errors on their pools when running Incus VMs.

The SSD pool where the Incus VMs were running is encrypted, as well as the main HDD pool with the issues, which seems to impact/change ZFS checksum calculation, compared to if the pools/datasets weren’t encrypted, and thus contributes to the issue(?)

I’ll do some more scrubs, re-enable Incus VMs to verify the issue appears/vanishes like described, and so on. I’ll likely also try swapping the 7700k/Z270F with a Ryzen 5650G on an x570 board, keeping everything else the same to see if the issue also manifests on a different platform.

I can’t really imagine what Incus would be doing differently to possibly cause such a weird issue like that. I’m also not aware of the i7 7700k having/causing any issues like this. If there’s any more ideas on what else to test, let me know.

Captain_Morgan · April 20, 2025, 1:50am

Can you descibe the system… size of RAM, number of drives etc.

My “statistical” guess would be non-ECC memory errors could cause these issues. It can take days/months to see another error. So, memory testing doesn’t rule it out.

Is the encrption based on ZFS or SED?

TheJulianJES · April 22, 2025, 2:38pm

These are the drives:

3x Seagate Exos X22 ST22000NM001E 22 TB for the HDD pool “dozer”
1x Samsung SSD 970 EVO 1 TB for the SSD pool “ssd” where the VMs are running, also set as the system pool
1x WD Black SN750 NVMe 500 GB for the “boot-pool”

Encryption is all done in ZFS for the “ssd” and “dozer” pools (so, the root dataset).

This is the other hardware:

Motherboard: ASUS ROG STRIX Z270F GAMING
CPU: Intel i7-7700k
4x 16 GB Corsair VENGEANCE DDR4 3200MHz (CMK32GX4M2E3200C16)

Docker applications running during the scrubs:

Immich, Jellyfin, Librespeed (ix-app), OpenSpeedTest (ix-app), Plex, Swag (ix-app)

VMs running during the scrubs:

Windows Server 2025 (based on 24H2)
Windows 11 LTSC (24H2)

The weird thing is that I’ve had five subsequent scrubs fail with multiple errors on the dozer HDD pool. The Docker applications and VMs were mostly just idling during the scrubs.
Looking at the logs/system uptime again, the first four scrubs were run without restarting the TrueNAS machine. The fifth run was done after restarting the TrueNAS machine, and yet it still failed.

After that, I’ve unset the Incus pool to deactivate everything VM related and restarted the machine again, that scrub ran with no errors detected.

I’ve also set the pool to “ssd” again for Incus, without starting any VMs, and no errors detected as well. Another run with just the Windows Server VM also detected no issues. Lastly, one with both Windows VMs started also ran successfully.

The only change in configuration between the initial failures and the successful runs now were that I set the Incus/VM config from “Automatic” network to “br0”. The individual VMs were already set to use “br0” before though. And this really shouldn’t affect anything.
The other change was that I unset and re-set the pool once, keeping existing VMs. The “ssd” pool was first set in RC.1, but I’ve already reimported the existing zvols into managed Incus volumes before all the failed scrubs on the release version.

So, I’m at a bit of a loss with what caused this issue. You might be right, maybe it is bad RAM after all.
I guess I’ll take the motherboard + CPU + RAM out of that system soon to run a memtest for like a week, and replace the hardware with something else in the mean time. I’ll keep this thread updated. Thanks for the interest!