Branching off my comment from the release post: TrueNAS 25.04.0 now available! - #49 by TheJulianJES
I’ll update this thread, as I get more information on this and try some things.
TL;DR is that running Incus VMs on 25.04 seemingly cause checksum errors on a different (encrypted) HDD pool when a scrub is running there.
Shutting down the VMs and unsetting the Incus pool causes subsequent scrubs to run without any issues.
System: i7 7700k/Z270F, on-board SATA controller for 3x Seagate Exos x22 22TB HDDs in a RAIDZ(1), NVMe SSD for boot, another NVMe SSD for “ssd” pool where Incus VMs were running.
Text below is a copy of the linked comment for reference:
Upgraded from RC.1 to the release version, as soon as it was available.
I previously had system freezes on RC.1, likely related to Incus VMs: TrueNAS 25.04-RC.1 is Now Available! - #141 by TheJulianJES
I did a bit of research after that post and stumbled upon this thread: Incus VM Crashing, which I also experienced on RC.1.
After upgrading to the Fangtooth release version, I imported the existing ZVOLs into the managed Incus volumes and kinda hope that the freezing and crashing problems are resolved with that (all VMs use VirtIO-SCSI). I’ll keep monitoring obviously.
However, I’ve been getting checksum errors on my main (encrypted) HDD pool on a i7 7700k/Z270F system since upgrading from 24.10 to 25.04, only 1 or 2 checksum errors per run, sometimes spread across multiple disks.
I did five(!) separate full scrubs with system reboots in between and got checksum errors every time. I obviously suspected the disks first, but they seem to check out fine. A full/long SMART test also completed without issues. It was also unlikely to have multiple disks “failing” at the same time with widely differing production dates.
Since the system doesn’t have ECC ram, I ran a multiple hour-long memory test, which checked out fine as well.
Later, I also read Frequent Checksum Errors During Scrub on ZFS Pool · Issue #16452 · openzfs/zfs · GitHub (about an AMD CPU) and someone mentioned that VMs can sometimes impact ZFS checksum calculation on “flawed” hardware.
Finally, I shut down both of my Windows Incus VMs that were running on a separate (encrypted) SSD pool, unset the Incus/VM pool to completely disable that part of TrueNAS, and did another reboot of the machine.
Now, a sixth scrub is about 70% done and I’ve got no checksum errors so far, whilst I’ve always had ones before before with 25.04 and the VMs running.
(24.10 was fine since release with the same VMs and pool, never any issues during scrubs.)
Digging into the checksum errors, the same zio_objset
and zio_object
were present across different scrubs. The same part of encryption metadata, written/created months ago according to ZDB on a particular dataset.
Other errors were present as well, but they differed a bit.
The same object being seemingly corrupt multiple times reinforced my suspicion of a hard drive issue at first, but I now feel like the CPU keeps messing up the same calculation for the checksum there, for some reason…
This is obviously a very weird issue, especially since 24.10 VMs weren’t seemingly affecting scrubs on an entirely different ZFS pool, but I’d be really interested to see if anyone else suddenly sees checksum errors on their pools when running Incus VMs.
The SSD pool where the Incus VMs were running is encrypted, as well as the main HDD pool with the issues, which seems to impact/change ZFS checksum calculation, compared to if the pools/datasets weren’t encrypted, and thus contributes to the issue(?)
I’ll do some more scrubs, re-enable Incus VMs to verify the issue appears/vanishes like described, and so on. I’ll likely also try swapping the 7700k/Z270F with a Ryzen 5650G on an x570 board, keeping everything else the same to see if the issue also manifests on a different platform.
I can’t really imagine what Incus would be doing differently to possibly cause such a weird issue like that. I’m also not aware of the i7 7700k having/causing any issues like this. If there’s any more ideas on what else to test, let me know.