Truenas Core - Slog Failure Testing - RMS 200

TYnextnext · April 11, 2024, 4:24pm

Hello, just reaching out to see if anybody else is seeing the same results as I am with the Radian RMS-200. To summarize the testing If the RMS-200 dies/removed it results in a hard down for the pool instead of showing a degraded state. I have tested this with 2 different RMS-200. It is recoverable though.

Truenas - Slog Failure Testing

Pre-Setup:

This is a test environment no data has been written to these drives. The purpose of this test is to check for pool survival if the SLOG device fails (RMS-200). When I refer to the SLOG device i am talking about the Radian RMS-200.

Platform: Generic

Version: TrueNAS-13.0-U6.1

CPU: Intel(R) Xeon(R) E-2146G CPU @ 3.50GHz

Slog Device: Radian RMS-200 rev04

Drive Layout:		12 x Mirrored VDEV
4TB - Z1Z8AAPK	4TB - Z1Z8AH2N	4TB - Z1Z5Z66N	4TB - Z1Z907ZS
4TB - Z1ZAP2F5	4TB - Z1ZARRDN	4TB - Z1ZARQY2	4TB - Z1ZARRWS
4TB - Z1ZAT5N8	4TB - Z1ZARFSA	4TB - Z1ZASLFE	4TB - Z1ZAVJD9
4TB - Z1ZASMKQ	4TB - Z1ZART09	4TB - Z1ZAVJAH	4TB - Z1ZARQ6A
4TB - Z1ZARSJM	4TB - Z1ZART5R	4TB - Z1ZARRKL	4TB - Z1ZASM9Y
4TB - Z1ZARQKG	4TB - Z1ZARR9G	4TB - Z1ZARS4R	4TB - Z1ZAL92Y

Test 1: Removal of slog device while system is powered off.

Description:

In this test I will be removing the SLOG device completely out of the system. The system still has AC power to the board, but is just powered down. No other system components will be changed during this test. The only variable is the Slog Device. After, the removal of the slog device we will be powering on the system.

Results:

Disks:

All disks are listed excepts the Slog device

Pool Status:

The pool shows offline and not in a degraded state.

ZPool Status in shell:

Only the boot pool shows up.

Conclusion:

The removal of the SLOG device while the system is powered off leads to the ZFS pool becoming offline. This results in a ‘hard down’ state of the pool, meaning the pool is inaccessible through the normal GUI methods until corrective action is taken. To resolve this, access to the system shell is essential to execute specific ZFS commands for pool recovery. Once the necessary steps are performed, involving the forced import of the pool and adjustments in the TrueNAS interface, the pool can be restored to a healthy state. This test highlights the importance of understanding the impact of SLOG device removal and the procedures required to recover from such scenarios. It underscores the resilience of ZFS pools against hardware changes but also the need for administrative intervention for recovery.

Fix:

Importing the Pool:

Use the command zpool import -m -f <pool_name> to forcefully import the pool.
- -m mounts the file systems.
- -f forces the import, useful if the pool was not properly exported or the system thinks it’s still in use.
Example: zpool import -m -f mypool

Checking Pool Status:

Run zpool status to check the health and status of the pool. It should now show the pool in a degraded state. Follow the next steps to add or remove the slog device.

This should also reflect in the GUI as well.

Removing or Replacing the SLOG:

Access TrueNAS Web Interface.
Go to Storage > Pools > Status.
Find your pool and select the SLOG device you wish to replace or remove.
To remove:
- Select the SLOG device and choose the option to remove it.
To replace:
- Select the SLOG device and choose the option to replace it.
- Follow the prompts to add the new SLOG device.

Post-Operations Checks:

Verify Pool Status: After replacing the SLOG, run zpool status again to ensure the pool is healthy.
Monitor Performance: Check if the performance meets your expectations.
Data Integrity Check: Consider running a scrub to verify data integrity: zpool scrub <pool_name>.

Importing the Pool:

Use the command zpool import -m -f <pool_name> to forcefully import the pool.
- -m mounts the file systems.
- -f forces the import, useful if the pool was not properly exported or the system thinks it’s still in use.
Example: zpool import -m -f mypool

Checking Pool Status:

Run zpool status to check the health and status of the pool. It should now show the pool in a degraded state. Follow the next steps to add or remove the slog device.

This should also reflect in the GUI as well.

Removing or Replacing the SLOG:

Access TrueNAS Web Interface.
Go to Storage > Pools > Status.
Find your pool and select the SLOG device you wish to replace or remove.
To remove:
- Select the SLOG device and choose the option to remove it.
To replace:
- Select the SLOG device and choose the option to replace it.
- Follow the prompts to add the new SLOG device.

Post-Operations Checks:

Verify Pool Status: After replacing the SLOG, run zpool status again to ensure the pool is healthy.
Monitor Performance: Check if the performance meets your expectations.
Data Integrity Check: Consider running a scrub to verify data integrity: zpool scrub <pool_name>.

Test 2: Removal of slog device while system is powered on.

Description:

In this test I will be removing the SLOG device completely out of the system. The system is powered on. No other system components will be changed during this test. The only variable is the Slog Device.

Results:

Disks:

All disks are listed excepts the Slog device

Pool Status:

The pool shows Degraded.

ZPool Status in shell:

The test pool shows degraded. With only the Slog device missing.

Conclusion:

The direct removal of the SLOG device from an active TrueNAS system resulted in immediate changes to the ZFS pool status. Unlike the first test where the system was powered off, removing the SLOG device from an operational system led to the pool being marked as ‘Degraded’. This state reflects the absence of the SLOG device but also indicates that the pool is still functional, albeit without the benefits provided by the SLOG.

Fix:

Access TrueNAS Web Interface.
Go to Storage > Pools > Status.
Find your pool and select the SLOG device you wish to replace or remove.
To remove:
- Select the SLOG device and choose the option to remove it.
To replace:
- Select the SLOG device and choose the option to replace it.
- Follow the prompts to add the new SLOG device.

Post-Operations Checks:

Verify Pool Status:
Monitor Performance: Check if the performance meets your expectations.
Data Integrity Check: Consider running a scrub to verify data integrity: zpool scrub <pool_name>.

Test 3: Removal of slog device while system is powered off. With dip switch 8 towards the up position (Away from motherboard).

Description:

In this test I will be removing the SLOG device completely out of the system. The system still has AC power to the board, but is just powered down. With dip switch 8 towards the up position (Away from motherboard). No other system components will be changed during this test. The only variable is the Slog Device. After, the removal of the slog device we will be powering on the system.

Results:

Disks:

All disks are listed excepts the Slog device

Pool Status:

The pool shows offline and not in a degraded state.

ZPool Status in shell:

Only the boot pool shows up.

Conclusion:

The removal of the SLOG device while the system is powered off with dip switch 8 towards the up position (Away from motherboard) leads to the ZFS pool becoming offline. This results in a ‘hard down’ state of the pool, meaning the pool is inaccessible through the normal GUI methods until corrective action is taken. To resolve this, access to the system shell is essential to execute specific ZFS commands for pool recovery. Once the necessary steps are performed, involving the forced import of the pool and adjustments in the TrueNAS interface, the pool can be restored to a healthy state. This test highlights the importance of understanding the impact of SLOG device removal and the procedures required to recover from such scenarios. It underscores the resilience of ZFS pools against hardware changes but also the need for administrative intervention for recovery.

Fix:

[See test 1 Fix:]

TYnextnext · April 11, 2024, 9:20pm

I did some more testing with Scale. It seems to be functioning as expected in Scale.

Truenas Scale - Slog Failure testing

Pre-Setup:

This is a test environment no data has been written to these drives. The purpose of this test is to check for pool survival if the SLOG device fails (RMS-200). When I refer to the SLOG device i am talking about the Radian RMS-200.

Platform: Generic

Version: TrueNAS-SCALE-23.10.2

CPU: Intel(R) Xeon(R) E-2146G CPU @ 3.50GHz

Slog Device: Radian RMS-200 rev04

Drive Layout:		12 x Mirrored VDEV
4TB - Z1Z8AAPK	4TB - Z1Z8AH2N	4TB - Z1Z5Z66N	4TB - Z1Z907ZS
4TB - Z1ZAP2F5	4TB - Z1ZARRDN	4TB - Z1ZARQY2	4TB - Z1ZARRWS
4TB - Z1ZAT5N8	4TB - Z1ZARFSA	4TB - Z1ZASLFE	4TB - Z1ZAVJD9
4TB - Z1ZASMKQ	4TB - Z1ZART09	4TB - Z1ZAVJAH	4TB - Z1ZARQ6A
4TB - Z1ZARSJM	4TB - Z1ZART5R	4TB - Z1ZARRKL	4TB - Z1ZASM9Y
4TB - Z1ZARQKG	4TB - Z1ZARR9G	4TB - Z1ZARS4R	4TB - Z1ZAL92Y

Pool Status Before Test:

ZPool Status in shell:

Test 1: Removal of slog device while system is powered off.

Description:

In this test I will be removing the SLOG device completely out of the system. The system still has AC power to the board, but is just powered down. No other system components will be changed during this test. The only variable is the Slog Device. After, the removal of the slog device we will be powering on the system.

Results:

Disks:

All disks are listed excepts the Slog device

Pool Status:

The pool shows offline and not in a degraded state.

ZPool Status in shell:

Only the boot pool shows up.

Conclusion:

Fix:

Access TrueNAS Web Interface.
Go to Storage > Manage devices
Select the SLOG device you wish to replace or remove.
To remove:
- Select the SLOG device and choose the option to remove it.
To replace:
- Select the SLOG device and choose the option to replace it.
- Follow the prompts to add the new SLOG device.

Post-Operations Checks:

Verify Pool Status:
Monitor Performance: Check if the performance meets your expectations.
Data Integrity Check: Consider running a scrub to verify data integrity: zpool scrub <pool_name>.

etorix · April 11, 2024, 9:44pm

Removing PCIe cards from a powered on system is not really recommended…

I’m usure what you’re trying to test, because all of this is perfectly predictable and documented:
If the SLOG is removed while the pool is off, on reboot ZFS cannot check whether the SLOG holds dirty data which should be committed to the pool => OFFLINE, data integrity no guaranted.
If the SLOG dies “live”, ZFS still has all dirty data in RAM, and will commit it to the pool as required => DEGRADED, but still guaranteeing data integrity.

Stux · April 11, 2024, 9:59pm

Ie, ZFS requires the administrator to re-online the pool if the SLOG “dies” while then pool is offline to prevent potential data loss, that could occur, for example, if the admin forgot to reinstall the SLOG device (like you did)

This seems to be working 100% as designed, expected and desired.

But great testing. How did you remove the RMS-200 while the system was on?

Davvo · April 12, 2024, 8:51am

If you require a SLOG, you do not want it to fail.

TYnextnext · April 12, 2024, 12:43pm

Thanks for the comment. it is exactly how it sounds I pull the card out of the system while it’s running.also, like in teset 1, If the slog dies during a power outage and power comes back up it kind stinks that it will hard down until I can get into it. Either way this is all for testing before I get it set up at home. The rms-200 has capacitors so it can store the data in flash. I just wish it came back on in a degraded state. You live you learn

TYnextnext · April 12, 2024, 12:44pm

Just weird how we see different results in scale. I guess I’ll be switching to it. Its probably documented I’ll do some more digging.

TYnextnext · April 12, 2024, 12:49pm

Everything will eventually fail…

Stux · April 12, 2024, 1:16pm

You could mirror it.

HoneyBadger · April 12, 2024, 2:26pm

This is intentional by design in ZFS due to the atomic nature of the filesystem. ZFS doesn’t like “maybe” when it comes to data, so if it’s looking for a log device to verify that there aren’t any pending transactions to the pool, it will refuse to proceed and potentially discard that missing data without administrative intervention.

When you performed the SLOG-removal on SCALE, I assume you still required a drop to the command-line to issue the zpool import -m (which by the way stands for -missing log device)?

ericloewe · April 12, 2024, 3:27pm

But hold on, shouldn’t ZFS automagically switch to using the in-pool ZIL? There’s no serious data loss concern, only a performance concern, but I can’t imagine that being addressed by throwing the pool offline wholesale.

Is it possible that the PCIe controller was reset as a result of the experiment, resulting in the transient loss of connection to the pool disks and thus causing it to go offline? PCIe cards really are not designed for hot plugging, and that has leaked into lots of PCIe code, only really getting addressed when Thunderbolt gained traction on PCs and NVMe came along.

That is to say, it’s not crazy at all to imagine Linux having an easier time dealing with weird behavior on the PCIe buses than FreeBSD. What the real impact of a real failure would be is harder to test without a test device that can simulate realistic failure modes.

NickF1227 · April 12, 2024, 3:51pm

Specifically for HOT removing a PCI-E card, this is not something that works in FreeBSD. Even formfactors designed for this don’t work. If you hot plug a U.2 NVME drive on CORE, you will have to reboot in order for you to be able to use it. That is not the case with SCALE.

If OP hot removed the SLOG on CORE and the pool went offline, I’d suspect (lack of) pci-e hot plug is a variable. They also may have had writes which weren’t fully committed, so ZFS played it safe. Without reproducing I’m not sure which mattered.

HoneyBadger · April 12, 2024, 4:30pm

If it’s done live yes, because the data written to the SLOG will still be in RAM. PCIe hotplug is also an issue as indicated by @NickF1227 but FreeBSD/CORE does handle the hot-removal (as a PCIe device failure/offline) correctly.

If it’s done with the system off, ZFS doesn’t know the full state of an unimported pool and thus errs on the side of caution by responding “I don’t know what’s potentially in this missing log device, so I’m going to tell you when the last successful transaction was based on the uberblock. Up to you if you want to make this permanent.”

In the case of a clean shutdown with an exported pool, the amount of data changed or pending should be “nothing” because of the ZFS semantics on export (flush txgs, final sync, mark member disks as “exported”) however see above re: ZFS not liking “maybe” and “should” as concepts.

TL;DR ZFS was written by the Sith and deals in absolutes.

NickF1227 · April 12, 2024, 4:34pm

Thanks for clarifying, I meant what I said in the context of unplugging and and then plugging it back in

TYnextnext · April 12, 2024, 5:16pm

@ericloewe @HoneyBadger When I use a Hard drive as the log device it does show degraded and switched to in-pool ZIL. Just confused by these mixed results.

Here is the test i did to confirm.

Test 4: Removal of (Hard drive) slog device while system is powered off.

Description:

In this test I will be removing the hard drive SLOG device completely out of the system. The system still has AC power to the board, but is just powered down. No other system components will be changed during this test. The only variable is the hard drive Slog Device. After, the removal of the slog device we will be powering on the system.

Results:

Disks:

All disks are listed excepts the hard drive Slog device

Pool Status:

The pool shows degraded. As expected, but is a different result then test 1

ZPool Status in shell:

pool shows up as degraded after start up without the slog…