Easiest way to recreate a pool? + Debug advice if you've got any

Vindicate · January 7, 2025, 6:27am

TL;DR I am looking for advice on how to recreate an entire pool. To be clear, I want to create a new pool with the same files and datasets but as little of the original pool data as possible due to probable corruption.

That probably sounds stupid and you’ll have to excuse the newbie mistakes, I’m learning and realising how ignorant I am on a lot of this. Anyway, there is a bit more than the TL;DR if you want to read and give further help though.

I’ll try to keep this somewhat brief but for context, I started using TrueNAS Scale 24.04.2.2 within the last few months. Started on an ancient machine with 8GB non-ECC RAM but after ~1 month I moved to one with ECC RAM (full specs at bottom). Worked flawlessly until December. Last month the scheduled scrub caught errors and shortly after the machine began boot looping from a kernel panic. After a week-long journey of zfs params, scrubs, and removing the corrupted files, ZFS claims the pools are healthy.

The pool is clearly not healthy, however, as anytime I make any writes to the pool larger than a single file or dataset, such as creating 2 datasets, setting several dataset permission changes, or copying a 1GB file, the system reboots almost immediately without warning. When watching a connected monitor, it’s working and then it isn’t, no error output. 100% consistency. Have not found anything from browsing various /var logs but I don’t know what I am looking for.

I am not quite sure what to do. I fresh installed TrueNAS 24.04.2.2 and shortly after fresh installed 24.10.1 with default config. Both were able to run for any amount of time idle and I can read and copy files off just fine, but the moment anything is written the machine crashes. I ran extended SMART tests on all drives and they came up clean. Ran Memtest for 24 hours / 7 passes with 0 errors. Stressed my CPU with stress for 2 hours without issue.

And silly me didn’t keep a second machine backup as I was carrying over my bad Windows habits of relying on a drive mirror to save me from any losses. Worked for years on NTFS but that clearly doesn’t work with ZFS, lesson learned.

To get to the main point,

I reckon I have to order some new drives and copy my files to them with a fresh pool, but I want to make sure I do that right so that I am…
A. not losing any files (snapshots would be an acceptable but regrettable loss),
B. not copying over any corruption from my current pool, and
C. preferrably not spending hours recreating datasets

That said, I have not totally exhausted debugging steps as I don’t know what logs to read or what lines on those logs to look for, nor am I 110% sure that the hardware is flawless. One of the drives could theoretically be dead or the power supply could be fucked, or so on. Unfortunately I am unsure how I would confirm any of that so right now my focus is on the ZFS pool.

I am actively debugging this in my free time and will update if I find anything new.

My system specs are:

CPU: Intel Xeon Gold 6150
RAM: 192GB ECC RAM
Pool layout and drives:
Primary Pool:

1 Mirror VDEV of 2x HGST 12TB HUH721212ALE601

1 Mirror VDEV of 2x WD Blue 8TB WDC_WD80EAZZ-00BKLB0 (I was planning to replace these in time, I understand WD Blue is a stupid choice for a NAS)

Davvo · January 7, 2025, 6:38am

Are you able to scrub the pool without issues? If there is corruption, doing so should fix it.

Please also post the output of zpool status -v.

Protopia · January 7, 2025, 11:46am

It seems to me to be only an assumption that the crashes / reboots are due to some sort of ZFS pool corruption. I think we need to confirm this from an error message or similar. Since the crash / reboot is re-creatable on demand, I would suggest:
- Disabling automatic reboot
- Watching IPMI or attaching a monitor
- Opening a SSH connection and watching dmsg.
- Be prepared to capture screen shots.
- Write to the pool and watch what happens / capture screen shots
If in the end you decide to copy your data, whilst we would normally recommend using ZFS Replication because this is fastest and because it replicates the datasets and keeps all the ACLs etc, because it works at a block / snapshot level we should probably avoid it and manually create new datasets and then use Rsync or cp to copy the data instead.

Vindicate · January 8, 2025, 2:39am

I ran multiple scrubs up until the point where zpool status returned clean, so it should still be able to yes. I could try running another after I attempt some of the other debug steps.

zpool status -v
  pool: Primary
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 21:06:37 with 0 errors on Thu Jan  2 20:50:40 2025
remove: Removal of vdev 1 copied 459G in 2h57m, completed on Fri Oct  4 01:53:37 2024
        812K memory used for removed device mappings
config:

        NAME                                      STATE     READ WRITE CKSUM
        Primary                                   ONLINE       0     0     0
          mirror-0                                ONLINE       0     0     0
            8d726519-43c6-4610-a4b9-b64af60da506  ONLINE       0     0     0
            26570496-e3a6-4a41-aaf5-ca1576b1af6f  ONLINE       0     0     0
          mirror-2                                ONLINE       0     0     0
            faae5334-7042-4566-8971-c6779e7ee75c  ONLINE       0     0     0
            f8cd4c73-04ec-4da7-af7f-aff7a09779a3  ONLINE       0     0     0

Vindicate · January 8, 2025, 3:29am

Thanks for the reply. It’s an assumption, yes, I don’t really have any clue of the actual cause as it seemingly randomly began occuring overnight.

I ran a copy operation while watching dmesg watch 0.1 "dmesg | tail -n 40" and monitoring the physical output, neither printed anything right before rebooting. It only got through ~40MB before crapping out. Recordings of nothing printing if anyone wants that: https://youtu.be/kKVkoR_W2mA https://youtu.be/YEs80_i-6Kg

Thanks for the copying recommendation. I assume rsync with the -aAX flags would be appropriate? And I assume I would have to set up the new pool and identical datasets before copying any data to avoid trouble creating those datasets over-top of the copied data later on.

Davvo · January 8, 2025, 6:56am

Are you perhaps running multipath?

Vindicate · January 8, 2025, 6:14pm

First I’ve heard that term so probably not. It’s a fresh install of TrueNAS Scale 24.10.1.

Vindicate · January 11, 2025, 10:14pm

So I tried running another scrub out of curiosity and the machine snagged and rebooted at some point during it, but at this point it would no longer complete boot. It now got stuck at Job ix-zfs.service/start running (Image).

At this point I jumped to my last debugging step of using completely different hardware and tried importing the pool via GUI on a different machine entirely with a 30-second-old install of TrueNAS Scale 24.04.2.5 and got the kernel panic adding existent segment to range tree message (Image) that began this whole thing 2 weeks ago.

I’ve now given up on repair. I confirmed that I can still access the files after importing in readonly mode with zfs_recover and zil_replay_disable enabled and have now have shut down the drives until new ones arrive and I can rsync them over. Would still appreciate advice on retaining full meta and permission data while using rsync if anyone has it, for now I plan to use rsync -aAX.