First off apologies if I make any assumptions or forget to include any info in this post. I am fairly new to ZFS
I have an install of TrueNAS SCALE 25.04 running on Proxmox 9.1.2 with an LSI 9305-16i attaching 13 Seagate IronWolf PRO 16TB drives. The LSI HBA is passed through to the VM and Proxmox has no control over the disks. The server is running ECC RAM (that will be relevant later).
While trying to delete about a TB of data from the pool over NFS the VM running the command froze as did the TrueNAS VM. I was forced to hard shutdown the TrueNAS VM. Upon attempting to boot it again I began to get errors about ix-etc.service failing to start as well as ix-zfs.service hanging on start until the 15 minute timeout. I also see ix-netif.service fail to start. If I ignore these errors and allow the VM to start as normal most ZFS related things are broken in the UI and notably zpool commands via the console hang.
Searching about this I discovered the following thread:
Based on what the OP here found I was able to get TrueNAS to start in a timely manner by disabling ix-zfs.service. A normal import of the pool here also hangs with no status and forces a reboot. However, I was able to import the pool with recovery mode:
From here is where my issue diverges from the original thread. As far as I can tell in my limited knowledge, the pool is fine. Unlike the OP from that thread I am using ECC RAM and see no reports of metadata corruption. However, I figure it is now best to ask for advice on what to do next. Do I scrub the pool and re-enable ix-zfs.service and see if TrueNAS will mount as expected? Should I perhaps try to mount manually after a scrub? Do I immediately migrate all the data and rebuild? I don’t want to make any further changes in recovery mode without some expert advice. If there is any missing information needed I am happy to provide it. Thanks in advance!
After doing a bit more reading it appears that it should be safe to do a scrub with the pool mounted in recovery mode, but it appears as if that stalls as well:
Checking the status of the pool shows no scrub in progress:
I’m at a total loss as to what to do to bring this pool back online correctly. Prior to this I tried importing the pool to a new TrueNAS VM as well as adjusting various Proxmox settings as I thought maybe some PCIE passthrough issues in Proxmox 9 as I had upgraded to that about a week ago.However, I hadnt seen any issues until now. Notably after this all started I was able to bring TrueNAS back up after rebooting the VM many times and everything seemed to be working. After a further reboot however, the issue came back as bad as ever. Perhaps an update to the Proxmox kernel is causing these I/O stalls?
I’ve attempted everything I can think of to address that but I’m not even sure if that is the issue. Is there a way to confirm? Is there any risk to the data in rebooting the TrueNAS VM over and over while trying to fix this? Would really like to save the pool as it’s quite large. Any help is greatly appreciated!
Something is causing txg_sync to hang which makes me think of an I/O lockup of some kind.
The fact you are seeing individual processes hang but eventually fail also makes me think ZFS isnt panicking, which is a good sign.
When you say “a normal import of the pool here also hangs with no status and forces a reboot”, is this just the terminal session you are in or does TrueNAS become completely inaccessible (i.e. are you still able to log in via SSH or the WebUI shell)? Ideally if you can still access the system while an import is hanging we can check exactly what is happening during it.
Edit: Seeing as the pool was not exported cleanly, ZFS will be attempting to replay the ZIL on import. You can try importing with the ZIL replay disabled just in-case something has gone wrong there. It is a bit less gung ho than trying to suppress ZFS panics with zfs_recover. echo 1 > /sys/module/zfs/parameters/zil_replay_disable
For reference I am running zpool import truenas-pool1 -R /mntvia the Proxmox console. The command just doesn’t seem to make any progress. Eventually the import will start returning more errors like above about blocked processes:
The WebUI remains accessible and I can still use the WebUI shell. Notably I can still mount with zpool import truenas-pool1 -R -o readonly=on even without invoking recovery mode and I can access the files just fine. It’s seemingly only when mounting with RW that I encounter issues.
Is there a log or something you’d like me to pull while this is running for diagnosis? Or should I proceed with import with ZIL replay disabled? Not sure of the implications of that.
Thanks for the help! Been pulling my hair out over this one
edit: Progress! I let the command above run for about an hour and it did eventually mount the pool. However since ix-zfs.service is still disabled I can’t enable NFS via the gui as it throws errors about dependencies that I assume that service fulfills.
After the successful import with ix-zfs.service disabled I re–enabled the service and rebooted. It took a couple minutes but the pool did import and all the services started as expected. Seems the issue is resolved.
Open to any theories as to why this occurred but my guess guess at this time is there was a bunch of pending transactions or something and the normal import with ix-zfs.service was timing out. Perhaps by completing it with the manual import cleared up whatever transactions were pending? Any steps to make sure this sort of thing does happen again?