I thought a small update, as thanks.
The pool is safe. I havent touched the main pool, yet (with all the snaps), Im working on one of its backups (current data but no snap history). Its alive, it tolerates sequential resilver and scrub, and Im now hard testing some spare disks before adding them to improve safety on the backup’s mirrors.
OVERVIEW OF KEY TACTICS USED:
Tricks I have found/learned/used, to improve recovery odds, for anyone else who is in this situation:
-
First, I touched the minimum I need to. Ive validated the state of the full pool, but otherwise not touched the disks. Instead Im getting all I can off the backups, which also teaches me what to expect and how to proceed, when I do tackle the full pool.
-
HDDs that have been mothballed a long time do NOT fail in ways that SMART can predict. Lubricant stiction (lack of correct distruibution/sickiness), oxidation, tiny thermal expansion under load, capacitor degradation, and nano-scale head positioning/weakness, all create routes to sudden failure that dont really impact on usual restore scenarios. So do NOT rely on smart or other usual measures to tell if a disk is OK. Assume they all are not, till you get data safe - then test them afterwards at leisure if you want.
-
The main ways to improve chances (if anything can be recovered) is:
- aggressive HDD cooling,
- aggressive HDD vibration damping, and
- minimising all non essential disk IO, and
- carefully and strategically planning the entire work as far as possible to avoid/minimise all head seek activity (long sequential access patterns only),
for long enough that ZFS can migrate copies of data to newer or more tested disks. That could be by adding disks to mirrors, removing data off top level vdevs, or send/receive to another pool.
-
Note that SSDs are still at risk (charge loss, capacitors, oxidation) but the risk is a heck of a lot lower. Prioritise getting the HDD based top level vdevs safe above all
APPROACHES USED
-
Luckily I only have mirror vdevs, and all metadata is on high quality SSD special vdevs. So I can add new HDDs using zpool attach -s, or pull data off them using send/receive.
New extra disks were my choice, because I could have better control over making I/O sequential and low stress, I could separate resilver from validation, and scrub can more easily be paused without losing progress..
-
I have a disk cradle with forced convection (if I didnt Id spread the disks out rather than use a caddy). Try to not use tightly attached steel caddies, those transmit vibration very well. supported off a wooden desk (airflow all around) is better than a caddy if you can. Try to use foam or something to damp all vibration (but not in a way that creates static charge risk) - critical. get as much airflow as you can.
In my case, high static pressure 140mm fans and every disk is essentially in a forced air tunnel - you need to keep temps under 45C, and if you can at all, aim for under 40C. Critical. Under 40C, thermal stress is very low. I managed to get 31 - 36C on every disk.
-
Get data safe first. Check it after. Kill all scrub activity, use sequential resilver or send/receive only. Check data after - its much lower priority. Damaged data will mostly affect individual files (assuming the metadata is on special vdev and safe). If an HDD vdev fails before you grab its data somewhere, the entire pool is likely to be lost.
-
Throw in as much RAM as you can, and reduce every other RAM hungry task you can. try to give ZFS all the RAM you are able. The more it has, the more it can buffer stuff, and organise scrubs from random I/O access patterns to optimised sequential I/O, and minimise head seek activity. You’ll need this
In my case I have 256 GB and nothing else in use. I gave scrub all of it to use. Scrub was able to traverse 2/3 of the entire pool’s metadata (which in my case was safe on special vdeb SSDs, low stress to read), and as a result when it came to read the HDD data vdevs, iostat showed that it was doing 99.8% sequential access, and almost all of that was 1024K blocks, at absolutely rock steady speeds - absolute perfection “goal achieved”.
-
if you used extra mirrors, not send/.receive, then once the vdevs are safe, you can consider whether to send elsewhere. But if you want to further test the disks, to check not only if data can be saved, but whether the disks are still serviceable, make sure that disk failures do not imperil the data. Plan for data safety prior to testing the disks harder.
-
if you added extra disks and you feel the data is safe, and used sequential resilver, now is the time to scrub. As well as stysctls to maximise sequential I/O, there is also a useful parameter zfs_btree_verify_intensity. Set it to 4, temporarily. Not only it’ll more thoroughly check fior issues related to the btrees ZFS uses, but it’ll work the disk harder, which gently acclimatises them to being in service, and lets you monitor their condition as they do so.
-
(contentious, plesse dont argue, just mentioning as option!) smartctl is also your friend, but takes some interpreting. If you arent skilled and dont mind tree burning, throw the output of smartctl -a /dev/DISK_ID at your preferred AI and ask it to assess the risk. It’lll probably do a better job of spotting patterns in the data, this is objective more than subjective so its less likely to hallucinate.
-
When all this is done, if you want to test disks for longer term continued use, once the data is safe, then I strongly suggest smart long test (smartctl -t long /dev/DISK_ID), followed by badblocks write testing (even if the long test aborts due to unreadable LBAs), IMMEDIATELY followed by a short smart test (badblocks -wsv -p 1 -b 4096 /dev/DISK_ID && smartctl -t short /dev/DISK_ID). then recheck output and badblocks error detection, and smartctl logs.
The logic here is in two parts:
- smartctl tests the disk according to manufacturers protocol, but badblocks specifically tests that the magnetic handling of every block, and the head, works, for patterns designed to reveal disk issues. I had smartctl long test fail on a disk I was about to chuck, and the RW pattern of badblocks resulted in it being “fixed”, revealing it was just not fixed during smart tests by firmware but otherwise fixed on use and a non-issue after that. (Conversely I had a 2nd disk that looked fine with smart tests, but badblocks revealed huge jagged spikes in the write graphs on the webUI, showing it was repeatedly failing to write and having to try again to succeed, that this was everywhere on the disk not one hot spot, and worsening over time. Firmware eventually solved it so no error was ever raised, but badblocks showed the true situation: the disk was basically dead behind the scenes and would soon hit a point firmware couldn’t cover up its inability to head track.) During these, look at iostat and the webUI graphs, look for smoothness of curves, and of IO stats, outliers may reveal issues that way.
… but both of those tests are largely sequential in nature. A short test immediately after badblocks completes, when the disk is already thermally exercised, will reveal if the head transport is failing.
A disk that passes all those and shows steady stable I/O rates, same shape graphs for each of the 4 badblocks passes, etc, is probably safe to rely on, and can continue to be monitored using usual tools - it’s probably well past the additional risk from mothballing.
SYSCTLs USED
The sysctl’s I used for maximising sequential access are suited to my situation, so they’re only “to give an idea what worked for me”, adapt them for your needs:
# Use half of RAM, next setting would be 1 which allows all of RAM - concerned as this feels unsafe/OOM risk, as there will be structures outside ARC
zfs_scan_mem_lim_soft_fact -> 2
zfs_scan_mem_lim_fact -> 2 (same comment)
# Counterintuitively, use maximum queueing. Don't worry about load - it lets the HDDs firmware improve and plan disk access. By design it'll mostly be sequential anyway.
# Also ensures heads much less likely to have to stop and start, safest if they are continually loaded and sequential.
zfs_vdev_scrub_max_active -> 64
# After mothballing, really test the btrees thoroughly
zfs_btree_verify_intensity -> 4
# Allow ZFS to work with data on a very long window
zfs_scan_max_ext_gap -> 33554432
zfs_scan_vdev_limit -> 16777216
zfs_scan_issue_strategy -> 1
# Set the gap limit to 32MB
zfs_vdev_read_gap_limit -> 33554432
# Maximise ARC size 240 GB
zfs_arc_max -> 257698037760
# allow ZFS to hold more of this in RAM to prevent frequent, small flushes to the SSD.
# Allow 16GB of "dirty" (pending) writes in RAM
# Prevent ZFS from throttling until the buffer is very full
# Set primary dirty data limit to 16GB
zfs_dirty_data_max -> 17179869184
# Set absolute hard limit to 20GB
# "zfs_dirty_data_max_max" MUST BE SET AT BOOT, USE WEBUI TO SET THIS WITH OPTION "ZFS" AND CHECK ITS VALUE AFTER BOOT. DELETE AFTER TO RESET TO NORMAL
zfs_dirty_data_max_max -> 21474836480
# Delay background sync until 60% of the 16GB buffer is full
zfs_dirty_data_sync_percent -> 60
# Delay intentional I/O throttling until 80% of the 16GB buffer is full
zfs_delay_min_dirty_percent -> 80
# Setting this to 32 makes the kernel less likely to prune the ARC.
zfs_arc_lotsfree_percent -> 3
zfs_arc_sys_free -> 8589934592
# System is stable, write out every 10 - 15 seconds not every 5, not much more work, and reduces the amount of intermittent seeking by a bit.
# For old drives the intermittent seeks spread lubricant and are good.
Too large a gap (eg 45 secs) would also cause internal drive cached data to be evicted as larger writes coalesce
zfs_txg_timeout -> 15
NOTE ON HOW THIS AFFECTS SCAN/SCRUB
You’ll notice unusual scan patterns with these (unsurprising, thats the aim). It should say “scanning” for many hours but zero issued for hours.
These settings will very much force scan to run in two phases, in one phase it traverses huge parts of the metadata trees and figuring what blocks to check are correct, and wehat order to check them in. This is all random access but for me, metadata seeks all hit special vdevs so no head seek issues.
Then in the second phase, it reads the blocks its found need checking, but does so by long sequential accesses because it could organise and optimise in RAM, as it had loaded so much of the metadata tree, and so much of the data was identified as needing to be checksumed and validated.
So you’ll see distinct “scanning” and “issuing” phases of many hours each.
ONE LAST TRICK
As my pool is deduped, if I want to grab the snapshot history from the original full pool, I have a 2nd option: I can send/receive it to this one, and it’ll take almost no extra space, because virtually all the files in the historical snaps should dedup against current data. havent yet decided whether to do that or not….. but using dedup this way is elegant if I need to!
FINAL THOUGHTS
I might write this up as a guide some time, but for now its at least here in case others need it.
Thank you everyone, once again - and plesse comment with any improvements, for others in future!