Lost power now my pools and shelf dont want to work

A while ago I changed my setup from my signature (which still will not save changes btw) to having 12x HC520 sata 12tb in a pool and the 12x 2tb and 3tb sata WD blues in another pool in my DS4246 (no interposers installed) as well as a 9305-16e to connect to it. It worked fine this whole time UNTIL last night the GFI plug everything is connected to tripped for some reason while I was out of the house.

That plug is on the same circuit as my christmas lights outside (for some reason) so I dont know if it was caused by some rain that got into a plug or something, or by my server. The whole server rack only uses 300w max and Ive been using that same exact plug for 2+ years now, I definitely dont have a kilowatt of christmas lights outside, and it didnt trip the breaker, so its not a normal overload problem.

However when i got home to reset everything, I found that now

  • my hc520 pool keeps failing to import and hanging
    • if i zpool import -f it will import, but will fail to create a mount point and be read only, and it ignores -F
      • trying to manually set a mount point while it is read only will cause it to hang forever
    • the pool is listed as healthy and all 12 drives are online with 0/0/0 r/w/c errors on all
    • i get txg_sync traces in the log after its hung for a while, see below
  • the wd blue pool will import, BUT the shelf will reset itself once truenas boots (all hdd lights go out then come back on in a few seconds) presumably during the import process so it will be suspended and even refuse to be exported
    • it also will be successful but also fully functional if i turn on the shelf and import after truenas boots
    • it also reads as healthy with 0 errors
  • ix-zfs.service during the boot process will hang forever if the hc520s are connected (does not do this with the wd blues) when normally it completes when its timer hits 1:30 to 1:45

Ive tried

  1. starting the shelf and/or inserting drives or the sas cable after truenas boots (see results above)
  2. trying 2 other known good hbas (9300, 9206)
  3. attaching the hc520s to the 9300 via sas to 4x sata cables (acts exactly the same as being in the shelf)
  4. upgrading the 9305 firmware (16 to 16.00.20)
  5. try booting with 3 of 12 hc520s removed (z3 pool) to see if i can isolate any bad drives causing the hanging (all 12 were tested over 4 boots, no luck)
  6. trying all of the different ports on the 9305 and 9206
  7. trying both qsfp slots on the iom
  8. swapping the ioms in case the top one was malfunctioning
  9. trying the hbas in all 4 pcie slots
  10. verifying the shelf, psus and ioms have no error lights
  11. reinstall truenas, with and without restoring the config

Im kind of at my wits end here, Im going to guess something(s) in the shelf are toast, but Im PRAYING it didnt damage my hc520s. Like i said the blues seem completely unaffected and hc520s all report good via zfs, it must just be something wrong with the state of the data combined with the now unreliable shelf, but idk what to do and how to fix any of this.

What a great christmas present I got, I love it and im SOOOOOOOOOO HAPPY!

Here is one of the txg_sync traces that showed up that I copied when attempting to import the hc520 pool:

Dec 24 09:39:04 TrueNAS kernel: task:txg_sync state:D stack:0 pid:14857 tgid:14857 ppid:2 flags:0x00004000
Dec 24 09:39:04 TrueNAS kernel: Call Trace:
Dec 24 09:39:04 TrueNAS kernel:
Dec 24 09:39:04 TrueNAS kernel: __schedule+0x461/0xa10
Dec 24 09:39:04 TrueNAS kernel: schedule+0x27/0xd0
Dec 24 09:39:04 TrueNAS kernel: schedule_timeout+0x9e/0x170
Dec 24 09:39:04 TrueNAS kernel: ? __pfx_process_timeout+0x10/0x10
Dec 24 09:39:04 TrueNAS kernel: io_schedule_timeout+0x51/0x70
Dec 24 09:39:04 TrueNAS kernel: __cv_timedwait_common+0x129/0x160 [spl]
Dec 24 09:39:04 TrueNAS kernel: ? __pfx_autoremove_wake_function+0x10/0x10
Dec 24 09:39:04 TrueNAS kernel: __cv_timedwait_io+0x19/0x20 [spl]
Dec 24 09:39:04 TrueNAS kernel: zio_wait+0x11a/0x240 [zfs]
Dec 24 09:39:04 TrueNAS kernel: dsl_process_async_destroys+0x326/0x580 [zfs]
Dec 24 09:39:04 TrueNAS kernel: dsl_scan_sync+0x184/0xa10 [zfs]
Dec 24 09:39:04 TrueNAS kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Dec 24 09:39:04 TrueNAS kernel: spa_sync_iterate_to_convergence+0x127/0x200 [zfs]
Dec 24 09:39:04 TrueNAS kernel: spa_sync+0x266/0x460 [zfs]
Dec 24 09:39:04 TrueNAS kernel: txg_sync_thread+0x1ec/0x270 [zfs]
Dec 24 09:39:04 TrueNAS kernel: ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
Dec 24 09:39:04 TrueNAS kernel: ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Dec 24 09:39:04 TrueNAS kernel: thread_generic_wrapper+0x5d/0x70 [spl]
Dec 24 09:39:04 TrueNAS kernel: kthread+0xd2/0x100
Dec 24 09:39:04 TrueNAS kernel: ? __pfx_kthread+0x10/0x10
Dec 24 09:39:04 TrueNAS kernel: ret_from_fork+0x34/0x50
Dec 24 09:39:04 TrueNAS kernel: ? __pfx_kthread+0x10/0x10
Dec 24 09:39:04 TrueNAS kernel: ret_from_fork_asm+0x1a/0x30
Dec 24 09:39:04 TrueNAS kernel:

1 Like

I wish I could help you but that is out of my depth and there is no reason to make your life more miserable with me giving you bad information.

Maybe Santa will bring you an UPS. :wrapped_gift:

1 Like

Not getting into specifics of the NEC and not getting into the specifics of GFCI receptacles vs GFCI breakers vs arc fault combo unit breakers. The GFCI likely tripped because rain got into the light string and caused enough current leakage to trip the GFCI. No overload. The GFCI did what it was designed to do remove the power from the protected circuit. Yes, the outdoor receptacles can be on the same circuit as ones that are indoor. The tripping of the GFCI if it caused the receptacle string to lose power would be just like a simple utility power outage nothing more.

I go with this is likely something simple but you could make it into a mess if not careful. Give this a try. I would power everything off completely on the system. Then turn on the disk shelf and wait a short bit. Then turn on the main server. My QNAP system was picky about order of operations and the disk shelf may be also. The reason was the expansion chassis was not up before the main server looked for the chassis. If it didn’t find it the expansion chassis the server would say it was not found and fail to load the pool or throw some other bad error. Sometimes stuff takes time to sort so I would mae sure the configuration and connections are consistent with what it was before he issue and be paitent. Maybe something needs a bit to sort itself. Throwing various different hardware at it is likely to get things in a state of no return.

R.e. The hc520 pool.

Let’s first establish baseline.

Can you import the pool readonly?
zpool import -o readonly=on -f -R /mnt tank

It looks like it was trying to run an async_destroy task during the poweroff and is now trying to continue freeing data on import. Seeing as you are seeing hangs, most likely it is now in an inconsistent state causing a loop/hang on import.

Might be able to import it with zfs_free_leak_on_eio=1 to break out of the zio_wait timeout (though just speculative, not something I’ve actually tested!).

3 Likes

ive tried waiting for the shelf and drives to all fully power up and settle before starting the computer and it is the exact same outcome (hangs on ix-zfs.service). it has also sit overnight for 8+ hours after i gave up for the night and didnt help at all in the morning, and additionally has sat for an hour or more at a time as i gave up and walked away to eat or whatever throughout today.

yes it will import and does so pretty much immediately, and it shows all of the data intact as far as i can tell. this is in a mint liveusb, however exporting and trying to import not readonly immediately after still results in a hang.

is this an import -o option?

It’s a param. Try setting echo 1 | sudo tee /sys/module/zfs/parameters/zfs_free_leak_on_eio, then importing via sudo zpool import -f -R /mnt tank
It should in theory tell ZFS to just ignore (leak) the bad free-queue entries. Shouldn’t be any risk to data here as it should just skip over failures.

1 Like

It’s been running for a few minutes, the drives are thinking really hard so it’s definitely doing something but I keep getting txg sync blocked for more than xxx seconds messages as time passes (and occasionally task zpool blocked etc etc too), and the import task hasn’t completed and given me back the command line yet. Is this supposed to take a while or should have finished relatively quickly? Also I pulled the drives back out of the shelf and hooked them to my 9300 again for maximum connection reliability.

Also I confirm to definitely was my Christmas lights that knocked out my server and started this whole misadventure, they just did it again right before I tried this so I just unplugged them for the time being because Im running out of hair to pull out.

I would give the import some time as zio_wait will be timing out/erroring out for each bad entry in the async_destroy queue. The task blocked messages are just the kernel complaining the thread is holding a lock for a long time.

All right, guess I’ll just try to be patient :face_exhaling:

OH OH IT FINISHED now what do i do? export and normal import? scrub? tell me oh zfs whisperer how to finally end this debacle!

1 Like

As it’s finished you should be okay to export (zpool export tank, echo 0 | sudo tee /sys/module/zfs/parameters/zfs_free_leak_on_eio, then try importing from the WebUI)

Edit: I would of course recommend running a scrub after.

3 Likes

dude, you 1000000% completely saved my ass, and literally christmas too because i was NOT having a jolly old time dealing with this.

i tried to export the pool but it was taking a while and i had to go, so i let it do its thing while i was away. i got back and it had completed, i reimported using the gui and after a nerve wracking 10 minutes or so it successfully imported! i was going to immediately start a scrub but the drives were way too hot running so long without airflow so i shut down the server and proceeded to pop out and hook up the other pool of wd blues to do the same thing. after that finished (near immediately, they must not have been in nearly as bad a shape) i shut down the server again, put all the drives back in the shelf, fired it up, and then fired up the server (using the 9300 with 8i8e adapter instead of the 9305, it was my previous setup and 100% reliable). to my glee the startup sequence went perfect and ix-zfs.service completed at about 1:30 as it always did. once i got into the gui i saw both pools were good to go and healthy (i didnt export them after the successful gui import). i did have to do zfs set mountpoint=/poolname poolname for the hc520 pool as it mounted to /mnt/mnt, the wd blue pool had mounted correctly on its own. i fired up a scrub for both pools and i am waiting on them to complete as i write this. ive turned all of my apps and shares off to help the scrub go as fast and smooth as possible, and hopefully :crossed_fingers::crossed_fingers::crossed_fingers: when they finish any remaining problems will be resolved and everything will be 100% functional and back in action.

4 Likes

No problem. Give me a shout if any issues, and have a good Christmas!

4 Likes

just wanted to update real quick that after 30 hours my scrub finished and returned 0 errors fixed. there were 2 more txg_sync traces in the log, but they both happened around midnight, one yesterday and the other today, but they didnt seem to cause any issues. nearing the end of the scrub i checked the remaining time and to my surprise it was at 107%, i was concerned because checking online most people say that this happens due to a failing drive (or lots of data written during the scrub, but this shouldnt have happened), but it finally finished at around 108%. i put the 9305 back in and fired the server back up, and again the boot sequence went exactly as it used to and turning the shares and apps back on so far everything seems to all be in order and good to go.

so BIG thanks again to @essinghigh for coming in clutch and seriously saving my bacon! :folded_hands:

ps: @joeschmuck i am actually eyeing a nice apc ups on ebay right now, if it doesnt go into stupid territory ill likely end up getting it

4 Likes

I’ll tell you, my UPS has saved my butt many times. I live in a place known for electrical storms and power outages, even if brief, they still occur. I think I have 5 or 6 UPS in the house. Well worth the money just to reduce the stress of recovery.

Glad you were able to get your system back online. @essinghigh provided you great advice, and my advice would have been to reformat it all, Ha Ha, maybe not. :laughing: