TL;DR I am looking for advice on how to recreate an entire pool. To be clear, I want to create a new pool with the same files and datasets but as little of the original pool data as possible due to probable corruption.
That probably sounds stupid and you’ll have to excuse the newbie mistakes, I’m learning and realising how ignorant I am on a lot of this. Anyway, there is a bit more than the TL;DR if you want to read and give further help though.
I’ll try to keep this somewhat brief but for context, I started using TrueNAS Scale 24.04.2.2 within the last few months. Started on an ancient machine with 8GB non-ECC RAM but after ~1 month I moved to one with ECC RAM (full specs at bottom). Worked flawlessly until December. Last month the scheduled scrub caught errors and shortly after the machine began boot looping from a kernel panic. After a week-long journey of zfs params, scrubs, and removing the corrupted files, ZFS claims the pools are healthy.
The pool is clearly not healthy, however, as anytime I make any writes to the pool larger than a single file or dataset, such as creating 2 datasets, setting several dataset permission changes, or copying a 1GB file, the system reboots almost immediately without warning. When watching a connected monitor, it’s working and then it isn’t, no error output. 100% consistency. Have not found anything from browsing various /var logs but I don’t know what I am looking for.
I am not quite sure what to do. I fresh installed TrueNAS 24.04.2.2 and shortly after fresh installed 24.10.1 with default config. Both were able to run for any amount of time idle and I can read and copy files off just fine, but the moment anything is written the machine crashes. I ran extended SMART tests on all drives and they came up clean. Ran Memtest for 24 hours / 7 passes with 0 errors. Stressed my CPU with stress
for 2 hours without issue.
And silly me didn’t keep a second machine backup as I was carrying over my bad Windows habits of relying on a drive mirror to save me from any losses. Worked for years on NTFS but that clearly doesn’t work with ZFS, lesson learned.
To get to the main point,
I reckon I have to order some new drives and copy my files to them with a fresh pool, but I want to make sure I do that right so that I am…
A. not losing any files (snapshots would be an acceptable but regrettable loss),
B. not copying over any corruption from my current pool, and
C. preferrably not spending hours recreating datasets
That said, I have not totally exhausted debugging steps as I don’t know what logs to read or what lines on those logs to look for, nor am I 110% sure that the hardware is flawless. One of the drives could theoretically be dead or the power supply could be fucked, or so on. Unfortunately I am unsure how I would confirm any of that so right now my focus is on the ZFS pool.
I am actively debugging this in my free time and will update if I find anything new.
My system specs are:
CPU: Intel Xeon Gold 6150
RAM: 192GB ECC RAM
Pool layout and drives:
Primary Pool:
- 1 Mirror VDEV of 2x HGST 12TB HUH721212ALE601
- 1 Mirror VDEV of 2x WD Blue 8TB WDC_WD80EAZZ-00BKLB0 (I was planning to replace these in time, I understand WD Blue is a stupid choice for a NAS)