VM image Zvol empty after in VM IO errors

Steven_Ellis · May 1, 2024, 7:25am

Got a weird one where some of my VMs suddenly have empty zVols. The zpool is a mirror of SSD drives and I recently noticed some IO errors in my VMs that forced a reboot. Most rebooted cleanly, but some suddenly ended up with what appears to be corrupted zVols.

root@truenas:/dev/zvol/ssd-storage/VMs# blkid  /dev/zvol/ssd-storage/VMs/VNC
root@truenas:/dev/zvol/ssd-storage/VMs# ls -l /dev/zvol/ssd-storage/VMs/VNC
lrwxrwxrwx 1 root root 13 Apr  7 11:06 /dev/zvol/ssd-storage/VMs/VNC -> ../../../zd64
root@truenas:/dev/zvol/ssd-storage/VMs# zfs list -t volume | grep VNC
ssd-storage/VMs/VNC              1.23G  64.8G  1.23G  -

I do have a recent backup, but this all looks a bit odd as it looks like I’ve lost the boot blocks. All of these are sparse VMs backed on ZVols. This particular one is an ancient 4GB VM disk.

Nothing in the TrueNas system logs to show any issues and a resilver of the zfs pool showed no errors.

I did a dd dump of the zvol and used testdisk which showed that the partition table had become corrupted. After using testdisk to fix the partition table I was able to recover one of the VMs.

joeschmuck · May 1, 2024, 9:17am

Can you provide some basic information: TrueNAS version, System hardware specs, and to make it clear, this sounds like you have been running this VM for a long time but how long does that mean, Days, Weeks, Months? And if it ran previously without issue, what version of TrueNAS was that? And did you recently upgrade to a newer TrueNAS version.

What I think we need is for you to try to identify if this could be pointed towards a specific event if this VM had been running for months. It it were days, well that is not much time to call it stable. But based on your post, it sounds like you should be able to provide some good information here.

Steven_Ellis · May 1, 2024, 7:50pm

Software version - TrueNAS-SCALE-23.10.2

Hardware

CPU - AMD Ryzen 5600G
MB - Gigabyte X570S UD ATX Rev 1.0
RAM - Crucial Pro 32GB (2x16)
SSDs
- ADATA SU650 256GB
- Team Group TEAMGROUP T-FORCE VULCAN Z 256GB

The SSDs are in a basic zfs mirror with the following datasets

VMs
ix-applications
ISOs

I have compression enabled, but no de-duplication etc.

The VMs we’re migrated off my old KVM based server and are sparse zVols. Most have been running for 5+ weeks with no errors or stability issues.

I was copying some additional OS ISOs into the ISOs dataset and my VMs started to show some I/O errors, which I’ve never seen with a KVM VM before.

May 01 17:30:53 bastion kernel: ata3.00: exception Emask 0x0 SAct 0x1c700 SErr 0x0 action 0x6 frozen
May 01 17:30:53 bastion kernel: ata3.00: failed command: WRITE FPDMA QUEUED
May 01 17:30:53 bastion kernel: ata3.00: cmd 61/08:40:20:7e:d5/00:00:00:00:00/40 tag 8 ncq dma 4096 out
                                                               res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
May 01 17:30:53 bastion kernel: ata3.00: status: { DRDY }
May 01 17:30:53 bastion kernel: ata3.00: failed command: WRITE FPDMA QUEUED
May 01 17:30:53 bastion kernel: ata3.00: cmd 61/08:48:f0:86:d5/00:00:00:00:00/40 tag 9 ncq dma 4096 out
                                                               res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
May 01 17:30:53 bastion kernel: ata3.00: status: { DRDY }
May 01 17:30:53 bastion kernel: ata3.00: failed command: WRITE FPDMA QUEUED
May 01 17:30:53 bastion kernel: ata3.00: cmd 61/10:50:70:87:d5/00:00:00:00:00/40 tag 10 ncq dma 8192 out
                                                               res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
May 01 17:30:53 bastion kernel: ata3.00: status: { DRDY }
May 01 17:30:53 bastion kernel: ata3.00: failed command: WRITE FPDMA QUEUED
May 01 17:30:53 bastion kernel: ata3.00: cmd 61/01:70:01:f0:d2/00:00:00:00:00/40 tag 14 ncq dma 512 out
                                                               res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
May 01 17:30:53 bastion kernel: ata3.00: status: { DRDY }
May 01 17:30:53 bastion kernel: ata3.00: failed command: WRITE FPDMA QUEUED
May 01 17:30:53 bastion kernel: ata3.00: cmd 61/08:78:98:06:d5/00:00:00:00:00/40 tag 15 ncq dma 4096 out
                                                               res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
May 01 17:30:53 bastion kernel: ata3.00: status: { DRDY }
May 01 17:30:53 bastion kernel: ata3.00: failed command: WRITE FPDMA QUEUED
May 01 17:30:53 bastion kernel: ata3.00: cmd 61/0b:80:1f:66:a7/00:00:00:00:00/40 tag 16 ncq dma 5632 out
                                                               res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
May 01 17:30:53 bastion kernel: ata3.00: status: { DRDY }
May 01 17:30:53 bastion kernel: ata3: hard resetting link
May 01 17:30:59 bastion kernel: ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
May 01 17:30:59 bastion kernel: ata3.00: configured for UDMA/100
May 01 17:30:59 bastion kernel: ata3.00: device reported invalid CHS sector 0
May 01 17:30:59 bastion kernel: ata3.00: device reported invalid CHS sector 0
May 01 17:30:59 bastion kernel: ata3.00: device reported invalid CHS sector 0
May 01 17:30:59 bastion kernel: ata3.00: device reported invalid CHS sector 0
May 01 17:30:59 bastion kernel: ata3.00: device reported invalid CHS sector 0
May 01 17:30:59 bastion kernel: ata3: EH complete

and

May  1 17:29:19 wiki kernel: [2096472.924979] ata3.00: exception Emask 0x0 SAct 0xffffffff SErr 0x0 action 0x6 frozen
May  1 17:29:19 wiki kernel: [2096472.926451] ata3.00: failed command: READ FPDMA QUEUED
May  1 17:29:19 wiki kernel: [2096472.927441] ata3.00: cmd 60/08:00:2f:a2:46/00:00:01:00:00/40 tag 0 ncq dma 4096 in
May  1 17:29:19 wiki kernel: [2096472.927441]          res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
May  1 17:29:19 wiki kernel: [2096472.930308] ata3.00: status: { DRDY }
May  1 17:29:19 wiki kernel: [2096472.931033] ata3.00: failed command: READ FPDMA QUEUED
May  1 17:29:19 wiki kernel: [2096472.932018] ata3.00: cmd 60/08:08:f7:a2:46/00:00:01:00:00/40 tag 1 ncq dma 4096 in
May  1 17:29:19 wiki kernel: [2096472.932018]          res 40/00:6e:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
May  1 17:29:19 wiki kernel: [2096472.934824] ata3.00: status: { DRDY }

Several of the VMs simply rebooted and came up OK, but two VMs no longer had partition tables on their zVols. I was able to recover the partition tables and the rest of the disk repaired on boot. Nothing obvious in the logs on the TrueNas server around that time that links with these issues

I’m concerned that this might have been caused by I/O starvation during the sync of the ISO images. Initial testing of the AData SU650 shows it doesn’t keeup up with the other SSD under sustained writes, but I wasn’t moving a huge amount of data around.

I’ve now moved the ISO dataset onto my main 6 drive raidz2 environment to try and avoid any future issues, plus taken some recent snapshots of my VMs.

Steven_Ellis · May 1, 2024, 7:59pm

I also took at look at zpool events on the affected pool and nothing was reported around the timeframe of the data loss.