Trying to import with the “-n” option does not actually do the import. This is what you want:
sudo zpool import -fFXT 3895701 -R /mnt VoxelZ2
Reminder, this will permanently throw out the most recent writes… But at this stage, that is probably a good thing because that is likely where the corruption lies.
I know that, it’s a dry run, was meant to see if things will go well and import the pool fine…
Blockquote
This is what you want:
sudo zpool import -fFXT 3895701 -R /mnt VoxelZ2
done that, again, result is :
‘can’t import as there is/are some disk(s) not available…’
btw there is a non available disk, this one died hard (no spin, no life) when I’ve just began expanding the pool by one new disk (at a little 5gb resilvered only) and got replaced by a new one, that started a resilver for its own replacement and put on hold the expansion…
btw, when I’ve disconnected the dead drive, I’ve forgotten to offline it first from the pool, hence why it is marked as ‘non available’… (this disk was returned to seller as it was still under guaranty)
so that was why I asked for a way of ‘offlining’ that disk before importing the pool again or if it was possible to Force the import to ‘forgot’ that disk completly as if it had been offlined…
The missing disk shall disappear for good when resilver is complete, and then expansion shall resume. The issue is that… the pool has other issues.
A raidz2 pool should import with one missing disk, if only as “degraded”.
well, fact is that the drive died under truenas scale v24.xx latest version, and as stated in the other thread, have began to experiment that damn’ ‘watchdog’ bug when the drive was being replaced…
so when v25 had been available I made the system and zfs upgrades, thinking that problem will disappear… not.
then v25 was upgraded again to current, and the problem was still there.
btw, upgrading to v25 comes with other problems as well : all cpu, disks, etc… widget ceased working and only shown blank space…, the ‘report’ space was also blank in all tabs…
I’ve came to see that the culrpit for all this was the ‘prefs’ saved by v24.xx… how ? well it all was working again after a clean install of v25 but returned to be bugged hard as soon as the old v24 prefs (saved before upgrade) had been uploaded to the nas, replacing the v25 default ones…
this is all that happened to the pool since the disk died.
At this point, analysing the damage is above my pay grade. I defer to @HoneyBadger , @PK1048 or @Arwen to give indications as to possible ways forward and associated risks.
Has the expansion completed? Note that when you do a vdev expansion you should run a scrub immediately as the expansion resilver does not confirm checksums. So if the expansion is done, then start a scrub and wait for it to complete. Note any errors reported by the scrub.
Let the system stabilize, then we can see what else might still be wrong.
nope, the expansion ran for a little as 5gb then the other drive died so the expansion is on hold till the resilver of the dead drive is finished (it was at 73% done) and even if I lose some files 5GB on 110TB it’ll not too bad.
no what to do as the import still fail ?
scrubbing needs the pool to be imported first, no ?
Yes, the zpool must be imported before the resilver can complete.
Given that you mentioned doing other things besides the 4 items I listed above, I have lost track of everything that has happened.
Do you still have all of the drives involved, including the failed one?
Please assemble a complete timeline (times and dates are not important, the actions and order they occurred are) of everything that has happened to this system since right before you grew the vdev. So the first item should something like “VoxelZ2 zpool consists of 4 vdev each 4 x 10TB HDD RAIDz2”, then walk through what you did and what the response of the system was. Make sure to include outside events such as power loss or cat rebooted system.
was created as Truenas scale (just after the name c hange form freenas), with 10 x 10 TB disks in 1 zpool RaidZ2, worked fine,
then when available, was expanded to 14 disks in the same zpool, one at a time, expanding caused some resilvering cause a first disk was faulty (too many errors) and been replaced by a new one, took time but finished fine anyway.
added tons of files, no problems,
another disk faulted :(, these time I put 4 x 20 TB disk in, copied all files to it, then destroyed the zpool, killed the bad disks, put new ones, reinstalled Truenas scale from scratch and re-created the RaidZ2 zpool with 16 disks. then copied again the contents of the 20 TB disks in the brand new zpool
added tons of files, no problems, but the zpool was full at more than 80% so I’ve wanted to expand it again,
that’s there that the other disk died like a bad piece of junk :(, expand halted, dead disk (no spin no life, still under guaranty, was new…) returned to reseller for replacement, put a brand new disk at its place and restarted the zpool.
it began a resilver for replacing the ‘not availlable’ disk and went up to 73% completion,
that’s there that problems began, watchdog suddenly crashing, without much explanations, halting all work on the zpool…
then appeared first upgrade to v25, thinking that could cure the watchdog bug I made the update… and got more problems as described before : widgets not functioning when the old ^refs was used (why on earth that prefs wasn’t ‘upgraded’ too at upgrade time ???), and still the watchdog bug…
so killed the old prefs and recreated it under v25, imported the zpool and resilvering runned, 'till another damn watchdog bug…
then that night there was a big downpour with plenty of lightnings and that killed the town electricity supply (was sleeping at that time) it took almost one full day to repair the stuff and the electricity came to life again
the ups was depleted about 90% poweroffing the machine…
when I restarted the truenas the zpool cryied it couldn’t be imported because of too many errors…
I checked all disks individualy on another machine (just the smart part) to see, all said fine.
except one disk : it showed lotsa errors in the nas but none in the other machine…
made a ddrescue on it to another blank new disk, took nearly 15 days ! and shown absolutely no errors.
checked the nas cabling, and as one of the raid expander card port was perhaps defective I recabled the nas as shown on top of this thread,
reinstalled the systeml to latest v25 version, re done the prefs again, and booted
No errors shown in boot log from the disks this time.
tried to import the zpool and that’s where it began to refuse to import it (return to begin of this thread)
arg.
ah, the nas mem was tested 24 hours long with no errors.
What do you mean by watchdog crashing? I always leave the BIOS level watchdog timers disabled as I have had them false positive reset systems.
If the resilver is what crashed, then it was time to try an import to a previous TXG, but I assume 73% of the way back in time is a large number of TXG.
Everything you have done since has just made the problem worse by changing the environment. I’m not sure how to recover at this point.
resilver start, work a little, then watchdog crashes, hang all sometime, then re-crashes…
I’ve seen a lot of these like in the pic one after another after the nas ran for a night…
That looks pretty straightforward, you have a bad CPU core/thread. CPU4 hung for over 6 seconds.
I do not think there is any I/O with a timeout that long, since we want to catch bad disks before they hang for 6 seconds, so I am leaning towards a hardware fault in the CPU.
Could the CPU be overheating under heavy load?
I would go into the motherboard setup utility and disable HyperThreading to reduce the load to one per core instead of two per core and see if it still hangs.
@HoneyBadger Jump in here if I am missing another possible cause for a CPU hang.
I don’t suspect a hardware fault here @PK1048 but rather an overly aggressive watchdog timer deliberately footshooting.
Kernel panic after just over six seconds is a far too aggressive timer - a ZFS transaction group even with the defaults could be as large as ~4GB, which could take more than that long to commit to a narrow/slow pool.
@jeffv03 can you explore your system’s BIOS and the Advanced menus to look for a “Watchdog” or similar timer setting, and if present, disable it?
echo 0 > /proc/sys/kernel/nmi_watchdog may also be able to disable in software.
hello
well I’ve looked at the bios and haven’t found any references to a watchdog… btw, while at it I cut all smp, hyperthreading et al that was availlable
I never use overclocking so it wasn’t necessary to look at it.
so I’ve used your cli line to kill the nmi_watchdog.
tried importing the pool and still get the ‘can’t import as one or more disks are unavaillable’ stuff…
as there was zfs updates (the new stuff of v24-25) BUT no new files was added after that, can I revert to v24.10 to try importing the pool even read only ?