LOG VDEV corrupted?

Muck8530 · January 19, 2025, 4:34pm

Hi,

I have issues with my Storage Pool’s Log VDEV (2 x DISK | 1 wide | 931.51 GiB)

Since a power outage that my UPS didn’t protect, the Log induced some problems:

boot time taking unusually long time
ix-netif.service/start
ix-zfs.service/start
Docker apps on the affected storage crashed the NAS
unable to simply remove the Log VDEV via GUI without the NAS crashing

Data Scrubbing didn’t showed any error message

SMART showed nothing I could identify as problematic

And here’s the Pool status:

truenas:~$ sudo zpool status -v
  pool: RZ1_3x16TB
 state: ONLINE
  scan: scrub repaired 0B in 18:23:30 with 0 errors on Sun Jan 12 18:23:34 2025
config:

        NAME                                      STATE     READ WRITE CKSUM
        RZ1_3x16TB                                ONLINE       0     0     0
          raidz1-0                                ONLINE       0     0     0
            5b569614-3b3e-4b48-a410-04fd3cc66fca  ONLINE       0     0     0
            be781cec-f5d0-4401-8a75-aebdd0ab9242  ONLINE       0     0     0
            dba821db-5293-47c6-a0d9-97b9537b1b71  ONLINE       0     0     0
          raidz1-2                                ONLINE       0     0     0
            2b748709-ac9c-46bc-943b-dcebaf5b0a0a  ONLINE       0     0     0
            8e75bf79-64bc-4673-9a9e-b258394ab922  ONLINE       0     0     0
            551e3e13-cca0-4b1a-8898-05cbc58d5716  ONLINE       0     0     0
        logs
          e5366e88-a61c-4f1a-a5fb-9d57d79f7bc6    ONLINE       0     0     0
        cache
          a6580a44-271f-4d9b-9908-469ca2d590b4    ONLINE       0     0     0
          ad793405-0fdb-4cc9-8838-84977be3e318    ONLINE       0     0     0

errors: No known data errors

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:23 with 0 errors on Fri Jan 17 03:45:24 2025
config:

        NAME           STATE     READ WRITE CKSUM
        boot-pool      ONLINE       0     0     0
          mirror-0     ONLINE       0     0     0
            nvme2n1p3  ONLINE       0     0     0
            nvme4n1p3  ONLINE       0     0     0

errors: No known data errors

I’m wondering if there’s a way to properly remove a SLOG from a Pool? Via CLI? A way fo “force” it since it’s really unstable?

I was able to Offline + Remove one of the Drive, but the second LOG Drive when put Offline crashes the NAS with those lines beforehand:

Afterwards, I’ll be adding back the LOG from scratch to make it back on tracks if possible

Thanks in advance!

Arwen · January 20, 2025, 2:23am

In general, LOG / SLOG device(s) should have PLP, Power Loss Prevention. Many do not, and it appears yours got corrupt on power loss.

At pool import, any active data in the LOG / SLOG that has not been written to the data vDevs should automatically be flushed to the data vDev(s). But, I guess if the data in the LOG / SLOG is corrupt, ZFS might not do it.

It is possible their is a bug in the GUI preventing LOG / SLOG removal. Probably something like this will do it;

zpool remove RZ1_3x16TB e5366e88-a61c-4f1a-a5fb-9d57d79f7bc6

Last, do you really need a LOG / SLOG vDev?

The main purpose is for synchronous writes, like for iSCSI, NFS, Database storage, and VM storage. LOG / SLOG are not a general purpose write cache.

Muck8530 · January 20, 2025, 8:54pm

Thanks for the reply!

I tried the command line, but the NAS crashes a few seconds afterwards.

Here’s a video recording of the NAS screen before and right after the crash.

Is there something else that can be done?
Something weird that shows up in the booting process?

As it can be seen, the following operations take a lot of time:

middlewared.service
ix-zfs.service
ix-netif.service

And yes, I need SLOG since I’m using NFS and VM storage

Thanks in advance!

etorix · January 20, 2025, 9:12pm

VM storage on raidz1? That’s not a recommended layout (mirrors all the way).

At worse, I suppose you can physically remove the corrupted drive and import the pool with the -m option, losing some transactions in the process.

Arwen · January 21, 2025, 12:21am

I agree with @etorix, this seems like your next step.

Muck8530 · January 21, 2025, 11:07pm

Thanks for the replies
I did exported / reimported the pool via command

zpool import -m -o altroot=/mnt RZ1_3x16TB

The operation succeeded, I was able to see the Datasets, but couldn’t see anything in the Storage Dashboard

I did a restart and the Datasets weren’t there anymore

Tried to import back the Pool from the GUI, ending with en error message that stated

[EZFS_BADDEV] Failed to import ‘RZ1_3x16TB’ pool: cannot import ‘RZ1_3x16TB’ as ‘RZ1_3x16TB’: one or more devices is currently unavailable

Is there a way to properly mount the Pool in Storage+Dataset sections whilst avoiding importing the sLOG?

Muck8530 · January 21, 2025, 11:19pm

I also tried this combination

zpool import -m -o altroot=/mnt RZ1_3x16TB
zfs set logbias=throughput RZ1_3x16TB
zpool remove RZ1_3x16TB 5258535957260169461
(the newly ID of the Log drive since physically removed)

It made the server crash again

Arwen · January 21, 2025, 11:41pm

The issue here is that the GUI / Middleware is un-aware of things you do with raw ZFS commands. Some ZFS command strings are harmless to the GUI / Middleware’s understanding of the pool.

So, the real thing to do is:

Physically remove the SLOG / LOG device from the server.
Via command line, import pool with “-m”.
Again from command line, remove the missing SLOG / LOG device from the pool
And last, again from command line, export the pool
Now you mport the pool with the GUI.
Verify everything is working, (well, obviously you don’t have a SLOG / LOG anymore).

If you so desire to re-add SLOG / LOG device(s), you can now do so.

Some of us know those steps, so we accidentally skip over them when walking a user through the process.

Muck8530 · January 21, 2025, 11:47pm

Got you! I understood those implied steps
Tho I tried them (as mentioned just before you replied I think!), it made the NAS crash again, even with the physical drive removed physically and the option -m indicated

Anything you think to do to avoid this redundant crash?

Arwen · January 22, 2025, 6:15am

Sounds like the issue is beyond me. Possible pool corruption. though I don’t know why, (you have ECC memory).