Help recovering zfs pool after VM passthrough on proxmox

I passed through two devices to a VM running on proxmox and corrupted it.

If anyone else wants to virtualise TrueNAS using proxmox use these commands on the proxmox hypervisor first to prevent concurrent access on the device you are passing through:

systemctl disable --now zfs-import-scan.service
systemctl disable --now zfs-import-cache.service
reboot

Can anyone help recover the pool?

This is the error I get:

# zpool import datatank
cannot import 'datatank': one or more devices is currently unavailable

Here are the devices that were in the pool (in reality I used the by-id path, not sdX, but this is shorter to show):

# lsblk -o name,size,fstype,label,model,serial,mountpoint|grep 12.7
sdc                12.7T                                  WDC WD140EFFX-68VBXN0 Z2KL9X12        
└─sdc2             12.7T zfs_member        datatank                                              
sdd                12.7T                                  WDC WD140EFFX-68VBXN0 Z2KNLX12        
└─sdd2             12.7T zfs_member        datatank                                              

Here are the headers:

# zdb -l /dev/sdc2
------------------------------------
LABEL 0 
------------------------------------
    version: 5000
    name: 'datatank'
    state: 0
    txg: 15493413
    pool_guid: 16911827047402167892
    errata: 0
    hostid: 383395800
    hostname: 'datatank'
    top_guid: 11077806832579153414
    guid: 13180809383598811762
    vdev_children: 3
    vdev_tree:
        type: 'mirror'
        id: 0
        guid: 11077806832579153414
        whole_disk: 0
        metaslab_array: 65
        metaslab_shift: 34
        ashift: 12
        asize: 13998367178752
        is_log: 0
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 8569546290774946668
            path: '/dev/sdb'
            devid: 'scsi-0QEMU_QEMU_HARDDISK_drive-scsi2'
            phys_path: 'pci-0000:01:03.0-scsi-0:0:0:2'
            whole_disk: 1
            DTL: 21406
            create_txg: 4
        children[1]:
            type: 'disk'
            id: 1
            guid: 13180809383598811762
            path: '/dev/sda'
            devid: 'scsi-0QEMU_QEMU_HARDDISK_drive-scsi1'
            phys_path: 'pci-0000:01:02.0-scsi-0:0:0:1'
            vdev_enc_sysfs_path: '/sys/class/enclosure/6:0:0:0/Slot 05'
            whole_disk: 1
            DTL: 9036
            create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
        com.delphix:device_removal
        com.klarasystems:vdev_zaps_v2
    labels = 0 1 2 3 

# zdb -l /dev/sdd2
------------------------------------
LABEL 0 
------------------------------------
    version: 5000
    name: 'datatank'
    state: 0
    txg: 15493413
    pool_guid: 16911827047402167892
    errata: 0
    hostid: 383395800
    hostname: 'datatank'
    top_guid: 11077806832579153414
    guid: 8569546290774946668
    vdev_children: 3
    vdev_tree:
        type: 'mirror'
        id: 0
        guid: 11077806832579153414
        whole_disk: 0
        metaslab_array: 65
        metaslab_shift: 34
        ashift: 12
        asize: 13998367178752
        is_log: 0
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 8569546290774946668
            path: '/dev/sdb'
            devid: 'scsi-0QEMU_QEMU_HARDDISK_drive-scsi2'
            phys_path: 'pci-0000:01:03.0-scsi-0:0:0:2'
            whole_disk: 1
            DTL: 21406
            create_txg: 4
        children[1]:
            type: 'disk'
            id: 1
            guid: 13180809383598811762
            path: '/dev/sda'
            devid: 'scsi-0QEMU_QEMU_HARDDISK_drive-scsi1'
            phys_path: 'pci-0000:01:02.0-scsi-0:0:0:1'
            vdev_enc_sysfs_path: '/sys/class/enclosure/6:0:0:0/Slot 05'
            whole_disk: 1
            DTL: 9036
            create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
        com.delphix:device_removal
        com.klarasystems:vdev_zaps_v2
    labels = 0 1 2 3 

What does sudo zpool status -v and sudo zpool import return?

(I started and switched to using the TrueNAS VM now, so different devices will show)

# zpool status -v
  pool: boot-pool
 state: ONLINE
config:
        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          sdc3      ONLINE       0     0     0

errors: No known data errors

# zpool import
  pool: datatank
    id: 16911827047402167892
 state: ONLINE
action: The pool can be imported using its name or numeric identifier.
config:

        datatank      ONLINE
          mirror-0    ONLINE
            sdb2      ONLINE
            sda2      ONLINE
          indirect-1  ONLINE
          indirect-2  ONLINE

# zpool import datatank
cannot import 'datatank': pool was previously in use from another system.
Last accessed by proxmox (hostid=fc6c0867) at Tue Oct  7 22:31:24 2025
The pool can be imported, use 'zpool import -f' to import the pool.

# zpool import datatank -f
cannot import 'datatank': one or more devices is currently unavailable

That, and passthrough the drive controller and blacklist it.

Time to call Batman @HoneyBadger

2 Likes

Oof. Yeah, that’s an unpleasant situation.

Normally when I see a double-mounted pool it throws back the “insufficient replicas/corrupted data” rather than “one or more devices is currently unavailable” - I do see some indirect devices there as well, so did you ever do a vdev removal (ie: adding more disks including special vdevs, switching between a virtual and a passthrough disk, etc)? IIRC there was a brief point in time where “block clone exists + vdev removal happens” could cause problems.

Can I get the output of the last hundred or so lines /proc/spl/kstat/zfs/dbgmsg immediately after a failed import?

1 Like

indirect devices: I did have another mirror of zdevs in the pool once, could it be that?

dbgmsg:

1760716386   ffff8f3db6d2c8c0 vdev.c:183:vdev_dbgmsg(): disk vdev '/dev/sda2': probe done, cant_read=0 cant_write
1760716386   ffff8f3db6d2c8c0 vdev.c:183:vdev_dbgmsg(): disk vdev '/dev/sdb2': probe done, cant_read=0 cant_write
1760716386   ffff8f3eaa22e100 spa_misc.c:2376:spa_import_progress_set_notes_impl(): 'datatank' Loading checkpoint t
1760716386   ffff8f3eaa22e100 spa_misc.c:2376:spa_import_progress_set_notes_impl(): 'datatank' Loading indirect vdev metada
1760716387   ffff8f3eaa22e100 spa_misc.c:2376:spa_import_progress_set_notes_impl(): 'datatank' Checking feature fla
1760716387   ffff8f3eaa22e100 spa_misc.c:2376:spa_import_progress_set_notes_impl(): 'datatank' Loading special MOS directori
1760716387   ffff8f3eaa22e100 spa_misc.c:2376:spa_import_progress_set_notes_impl(): 'datatank' Loading propertie
1760716387   ffff8f3eaa22e100 spa_misc.c:2376:spa_import_progress_set_notes_impl(): 'datatank' Loading AUX vdevs
1760716387   ffff8f3eaa22e100 spa_misc.c:2376:spa_import_progress_set_notes_impl(): 'datatank' Loading vdev metada
1760716389   ffff8f3eaa22e100 spa_misc.c:429:spa_load_note(): spa_load(datatank, config trusted): Read 663 log space maps (663 total blocks - blksz = 131072 bytes) in 2244 
1760716389   ffff8f3eaa22e100 spa_misc.c:2376:spa_import_progress_set_notes_impl(): 'datatank' Loading dedup tabl
1760716389   ffff8f3eaa22e100 spa_misc.c:2376:spa_import_progress_set_notes_impl(): 'datatank' Loading BRT
1760716389   ffff8f3eaa22e100 spa_misc.c:2376:spa_import_progress_set_notes_impl(): 'datatank' Verifying Log Devic
1760716390   ffff8f3eaa22e100 spa_misc.c:415:spa_load_failed(): spa_load(datatank, config trusted): FAILED: spa_check_logs fail
1760716390   ffff8f3eaa22e100 spa_misc.c:429:spa_load_note(): spa_load(datatank, config trusted): UNLOADING

Yep, that would cause indirect devices. Hopefully this isn’t it.

spa_check_logs might be a better place to fail though - that means usually that the last uncommitted transaction group was corrupted (probably via the double-mount) and we might be able to rewind a little bit.

Try the import with -fF - both letters included, with case mattering here. f is “force import pool mounted by other host” and F is “force rewind”.

1 Like
# zpool import datatank -fF
cannot import 'datatank': insufficient replicas
        Destroy and re-create the pool from
        a backup source.

I remember getting this error while testing zfs and pool settings inside a VM. If I recall, I had created several virtual test disks in VMWare and put them into varying configurations within the virtualized TrueNAS. When I got the error it was because I had messed with one of the disk files outside of the VM on the host machine. Nothing I did would fix it. I even had backups of those test disks (should some issue occur) so I didn’t have to recreate them, but shutting down and copying over the backup disk did not fix it. I eventually had to start over and create new disks, new pools, etc.

2 Likes

@HoneyBadger should I try importing half the mirror?

Edit: in case it’s a problem with the metadata being out of sync

I’m tempted to say that you might need more aggressive rewinding here with the risk of potentially losing some recent data here.

Importing with -fFX may do it but if it doesn’t you may even need to do

Warning - Dangerous Tunables Inside
sudo echo 0 > /sys/module/zfs/parameters/spa_load_verify_data
sudo echo 0 > /sys/module/zfs/parameters/spa_load_verify_metadata

in order to disable the default behavior of “immediately bail out on corrupted metadata” and tell it to keep forging ahead regardless.

2 Likes

It’s causing the VM to reboot. dmesg shows nothing. I will try at the hypervisor level.

@HoneyBadger After a day I could not see any I/o (with iotop), and no output or messages. I rebooted to try again.

Logs show a kernel message:

Oct 22 14:01:08 proxmox kernel: BUG: kernel NULL pointer dereference, address: 00000000000000a8Oct 22 14:01:08 proxmox kernel: #PF: supervisor write access in kernel mode
Oct 22 14:01:08 proxmox kernel: #PF: error_code(0x0002) - not-present page
Oct 22 14:01:08 proxmox kernel: PGD 0 P4D 0 
Oct 22 14:01:08 proxmox kernel: Oops: Oops: 0002 [#1] PREEMPT SMP PTI
Oct 22 14:01:08 proxmox kernel: CPU: 3 UID: 0 PID: 2779 Comm: dmu_objset_find Tainted: P          IO       6.14.11-4-pve #1Oct 22 14:01:08 proxmox kernel: Tainted: [P]=PROPRIETARY_MODULE, [I]=FIRMWARE_WORKAROUND, [O]=OOT_MODULE
Oct 22 14:01:08 proxmox kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./H170M-ITX/ac, BIOS P7.10 10/26/2016
Oct 22 14:01:08 proxmox kernel: RIP: 0010:mutex_lock+0x1c/0x50
Oct 22 14:01:08 proxmox kernel: Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb e8 ee d6 ff ff 65 48 8b 15 d6 eb >
Oct 22 14:01:08 proxmox kernel: RSP: 0018:ffffcbc7084abb80 EFLAGS: 00010246
Oct 22 14:01:08 proxmox kernel: RAX: 0000000000000000 RBX: 00000000000000a8 RCX: 0000000000000000
Oct 22 14:01:08 proxmox kernel: RDX: ffff8badc9bd8000 RSI: 0000000000000000 RDI: 0000000000000000
Oct 22 14:01:08 proxmox kernel: RBP: ffffcbc7084abb88 R08: 0000000000000000 R09: 0000000000000000
Oct 22 14:01:08 proxmox kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8bae63cfd198
Oct 22 14:01:08 proxmox kernel: R13: ffff8bae55bca000 R14: ffff8badc9a2b000 R15: ffff8bae63cfd000
Oct 22 14:01:08 proxmox kernel: FS:  0000000000000000(0000) GS:ffff8bb125180000(0000) knlGS:0000000000000000
Oct 22 14:01:08 proxmox kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 22 14:01:08 proxmox kernel: CR2: 00000000000000a8 CR3: 00000001ede38006 CR4: 00000000003726f0
Oct 22 14:01:08 proxmox kernel: Call Trace:
Oct 22 14:01:08 proxmox kernel:  <TASK>
Oct 22 14:01:08 proxmox kernel:  dmu_buf_rele+0x1e/0x50 [zfs]
Oct 22 14:01:08 proxmox kernel:  dsl_deadlist_close+0xdf/0x180 [zfs]
Oct 22 14:01:08 proxmox kernel:  dsl_dataset_hold_obj+0x86d/0xa80 [zfs]
Oct 22 14:01:08 proxmox kernel:  dmu_objset_find_dp_impl+0x12e/0x3f0 [zfs]
Oct 22 14:01:08 proxmox kernel:  dmu_objset_find_dp_cb+0x2a/0x50 [zfs]
Oct 22 14:01:08 proxmox kernel:  taskq_thread+0x339/0x6d0 [spl]
Oct 22 14:01:08 proxmox kernel:  ? __pfx_default_wake_function+0x10/0x10
Oct 22 14:01:08 proxmox kernel:  ? __pfx_taskq_thread+0x10/0x10 [spl]
Oct 22 14:01:08 proxmox kernel:  kthread+0xf9/0x230
Oct 22 14:01:08 proxmox kernel:  ? __pfx_kthread+0x10/0x10

Still no i/o.
I will wait another day.

That’s unfortunately not looking good. You applied the Forbidden Tunables in the spoiler-text above I assume?

I’m looking at other less-used pool toggles as well in order to prevent this in the future, but it’s the same root cause of Proxmox and TrueNAS both speaking ZFS.

I did.

I am surprised that some kind of scsi reservation didn’t prevent this.

Since this is a mirror and not something more complicated, is there another way of getting at the data?

You can try klennet.

Its Windows only. Scanning is free. Recovery is 400 USD.

3 Likes

That’s the problem with layering virtualization … it either masks features off, or lets you do things that you shouldn’t.

root@truenas[/home/truenas_admin]# sg_persist --read-reservation /dev/sdb 
  QEMU      QEMU HARDDISK     2.5+
  Peripheral device type: disk
PR in (Read reservation): command not supported
sg_persist failed: Illegal request, Invalid opcode

Unfortunately no. The same logical corruption of the metadata would have been written to both disks because it was being inserted from upstream as a valid write from each individual host.

But I digress - and I will make this a separate post as well.

Stopping this in the Future

I’d like to introduce the use of a pool setting to hopefully mitigate some of these events in the future - zfs multihost

This setting will cause ZFS to periodically make “heartbeat” writes to the pool, signifying it as being in use. There are drawbacks - potentially very long pool import times while it’s checking all member disks for those heartbeat writes, and scenarios where a pool that should import on reboot doesn’t.

However, it prevents a number of scenarios where you don’t want that pool to import.

root@pve:~# zpool import
  pool: vpool
    id: 386268640321606023
 state: UNAVAIL
status: The pool is currently imported by another system.
action: The pool must be exported from truenas (hostid=3c6fd28e)
        before it can be safely imported.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-EY
config:

        vpool                                   UNAVAIL  currently in use
          389a58b4-7260-4811-a156-58169089473d  ONLINE

Note the explicit “currently imported” error and “currently in use” status on the pool itself.

Even attempting to use the -f force parameter on the import option, ZFS will refuse.

root@pve:~# zpool import vpool -f
cannot import 'vpool': pool is imported on host 'truenas' (hostid=3c6fd28e).
Export the pool on the other system, then run 'zpool import'.

Please note: This is not a substitute for PCIe passthrough and storage controller isolation - it’s just putting a restriction on the zpool import command. There are still plenty of ways to obliterate the data via an errant host command or accidentally passing the same disk to a second VM.

2 Likes

Could you post a link here to some information about this? I’m using disk passthrough right now, I’d prefer to do something better.

I see you found it already, but the link for the others is

1 Like