[SOLVED] Dataset failing to mount & txg_sync call traces

SnackMasterX · August 13, 2024, 6:21pm

I used to do the incorrect thing of sharing out the root of my zpool via NFS back when I was on FreeNAS. This started to become more of a problem after migrating to TrueNAS Scale and started using Apps. I kept kicking the can down the road, but after upgrading to the 24.04 release, I finally had to fix my data layout.

I created a new dataset for NFS exports and started migrating data. It was taking an extremely long time to move data from the root dataset into the new dataset I created. With my limited understanding of ZFS, I’m still not sure if it’s expected for it to take a long time to move data within the same zpool when moving between datasets.

I started getting issues with some of the move operations and seeing txg_sync call traces in dmesg output:

[  605.310876] INFO: task txg_sync:5700 blocked for more than 241 seconds.
[  605.311203]       Tainted: P           OE      6.6.32-production+truenas #1
[  605.311545] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  605.311873] task:txg_sync        state:D stack:0     pid:5700  ppid:2      flags:0x00004000
[  605.312215] Call Trace:
[  605.312549]  <TASK>
[  605.312890]  __schedule+0x349/0x950
[  605.313232]  schedule+0x5b/0xa0
[  605.313534]  schedule_timeout+0x98/0x160
[  605.313837]  ? __pfx_process_timeout+0x10/0x10
[  605.314138]  io_schedule_timeout+0x50/0x80
[  605.314441]  __cv_timedwait_common+0x12a/0x160 [spl]
[  605.314761]  ? __pfx_autoremove_wake_function+0x10/0x10
[  605.315066]  __cv_timedwait_io+0x19/0x20 [spl]
[  605.315387]  zio_wait+0x10f/0x220 [zfs]
[  605.316012]  spa_sync_frees+0x3a/0x70 [zfs]
[  605.316694]  spa_sync_iterate_to_convergence+0x10a/0x200 [zfs]
[  605.317346]  spa_sync+0x30a/0x5e0 [zfs]
[  605.317968]  txg_sync_thread+0x1ec/0x270 [zfs]
[  605.318544]  ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
[  605.319156]  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
[  605.319486]  thread_generic_wrapper+0x5e/0x70 [spl]
[  605.319820]  kthread+0xe8/0x120
[  605.320132]  ? __pfx_kthread+0x10/0x10
[  605.320439]  ret_from_fork+0x34/0x50
[  605.320751]  ? __pfx_kthread+0x10/0x10
[  605.321067]  ret_from_fork_asm+0x1b/0x30
[  605.321383]  </TASK>

This seems to happen more frequently when I am trying to move or delete large amounts of data. Depending on how much data I tried to change (move or delete), the middlewared would always lock up but sometimes the system could recover, if it was a small amount of change. Other times a hard reboot was required. After I noticed this behavior, I stopped performing large changes as I didn’t want to cause any damage to the filesystem from dirty shutdowns.

I found github issue 9130 that looked like a possible match and people reported a wide variety of ‘solutions’ which ranged from replacing failed hardware (disk or HBA) to adding cache to the zpool. So far, I’ve been unable to identify any failed hardware in my server at all. I did try replacing the HBA with a spare (same chipset) that I had on hand and now my new NFS dataset isn’t even mounting, also seeing call traces in dmesg:

[  726.112956] INFO: task zpool:5999 blocked for more than 241 seconds.
[  726.113504]       Tainted: P           OE      6.6.32-production+truenas #1
[  726.113987] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  726.114475] task:zpool           state:D stack:0     pid:5999  ppid:4864   flags:0x00004002
[  726.114982] Call Trace:
[  726.115480]  <TASK>
[  726.115978]  __schedule+0x349/0x950
[  726.116489]  schedule+0x5b/0xa0
[  726.117004]  io_schedule+0x46/0x70
[  726.117504]  cv_wait_common+0xaa/0x130 [spl]
[  726.118039]  ? __pfx_autoremove_wake_function+0x10/0x10
[  726.118550]  txg_wait_synced_impl+0xc0/0x110 [zfs]
[  726.119716]  txg_wait_synced+0x10/0x40 [zfs]
[  726.120789]  zil_replay+0x107/0x150 [zfs]
[  726.121854]  zfsvfs_setup+0x226/0x260 [zfs]
[  726.122922]  zfs_domount+0x3b2/0x410 [zfs]
[  726.123951]  zpl_fill_super+0x34/0x60 [zfs]
[  726.124983]  zpl_mount_impl+0x273/0x2d0 [zfs]
[  726.126013]  zpl_mount+0x2f/0x90 [zfs]
[  726.127002]  legacy_get_tree+0x2b/0x50
[  726.127468]  vfs_get_tree+0x29/0xf0
[  726.127921]  path_mount+0x493/0xab0
[  726.128367]  __x64_sys_mount+0x11a/0x150
[  726.128802]  do_syscall_64+0x59/0xb0
[  726.129233]  ? select_task_rq_fair+0x17e/0x1710
[  726.129667]  ? sched_clock+0x10/0x30
[  726.130099]  ? sched_clock_cpu+0xf/0x190
[  726.130527]  ? sched_clock+0x10/0x30
[  726.130954]  ? sched_clock_cpu+0xf/0x190
[  726.131377]  ? __smp_call_single_queue+0xad/0x120
[  726.131808]  ? ttwu_queue_wakelist+0xe6/0x100
[  726.132241]  ? try_to_wake_up+0x205/0x580
[  726.132668]  ? wake_up_q+0x4e/0x90
[  726.133092]  ? futex_wake+0x155/0x180
[  726.133522]  ? do_futex+0xde/0x1b0
[  726.133908]  ? __x64_sys_futex+0x92/0x1d0
[  726.134293]  ? handle_mm_fault+0xa2/0x370
[  726.134683]  ? syscall_exit_to_user_mode+0x22/0x40
[  726.135071]  ? do_syscall_64+0x65/0xb0
[  726.135448]  ? exc_page_fault+0x77/0x170
[  726.135825]  entry_SYSCALL_64_after_hwframe+0x78/0xe2
[  726.136232] RIP: 0033:0x7f2b882a5d3a
[  726.136605] RSP: 002b:00007f2b85abb868 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[  726.137002] RAX: ffffffffffffffda RBX: 00007f2b85abeaf0 RCX: 00007f2b882a5d3a
[  726.137373] RDX: 00007f2b8857b079 RSI: 00007f2b85abfb50 RDI: 000055c241c0b470
[  726.137704] RBP: 00007f2b85abd930 R08: 00007f2b85abc8e0 R09: 0000000000000073
[  726.138037] R10: 0000000000000400 R11: 0000000000000246 R12: 000055c241c0b460
[  726.138367] R13: 00007f2b85abfb50 R14: 000055c241c0b470 R15: 00007f2b85abc8e0
[  726.138697]  </TASK>

At this point I’m not even sure what direction to run in to try and gain access to my partially migrated data, or resolve the original issue. Any advice would be deeply appreciated.

System details:
Version: Dragonfish-24.04.2
CPU: Xeon E5-2618L v4
Memory: 256GB RAM
Storage: LSI SAS 3008 HBA
vdev: RAIDZ2 - 12x 14TB Toshiba MG07ACA14TE disks

Thank you!

SnackMasterX · August 13, 2024, 9:09pm

After my system idled for about 3 hours, the dataset came up on it’s own. I started processing a large delete for testing, it’s almost completed and I have yet to get any trace calls in dmesg. I suppose this is just another example of a silent HBA failure.

Hope this post helps someone in the future.