Integrate "pool checkpoints" into the GUI, middleware, and automation

ZFS has a feature called “pool checkpoints”

This is a simple yet powerful feature that can prevent data loss and safeguard against ruining an entire pool.

TrueNAS, which leverages the robustness and power of ZFS, does not provide a way to manage, monitor, and automate pool checkpoints.

While pool checkpoints share similarities with dataset “snapshots”, their scope and usage are very different and highly specific.

Pool checkpoints have some important caveats and they are not for everyone.

:point_right: Read this topic to understand what they are, why they matter, how to use them, and caveats about their consideration for your use-case.

For what it’s worth, I’m currently using them with a custom Cron Job that runs a simple command.

Here is my Cron Job

Cron Job
Description: Daily checkpoint for my pool
Command: zpool checkpoint -d -w mypool; zpool checkpoint mypool
Run As: root
Schedule: 0 3 * * * (03:00 every day)


What this does is create a new checkpoint daily at 03:00. This means that I will never sit on a checkpoint for more than 24 hours, while hopefully giving me enough time to “use” a checkpoint if something goes wrong, as long as I “rewind” the pool or disable the Cron Job before the next time the clock hits 03:00. (Otherwise, the Cron Job will replace the good checkpoint with a new checkpoint after the mistake was made.)

The -w flag is important for the Cron Job, since you want the first half of the command to “exit” only after it finishes discarding the checkpoint. The semicolon is also important, since you want them to run one after the other, even if the first half “fails” because “no checkpoint currently exists”.


How should TrueNAS integrate “pool checkpoints” into its product?

This is not easy to answer.

It needs to be available in the GUI for manual operation, management, and review.

It needs to have “automatic triggers” that create a fresh checkpoint immediately before a pool-modification task.

It needs to be available as an automated task in the same way that snapshots can be automated.

It needs to be available in the Pool Import wizard.

It needs to be considered for other parts of the middleware that could error or fail in the presence of an existing checkpoint.


My non-dev proposition

I am not a developer, so I can only explain how checkpoints should be implemented from an end-user’s perspective.


Manual Operation
Add a button that allows the user to manually manage and view a pool’s checkpoint.

This can be placed inside a pool’s page in the GUI.

It can be used whether or not an automated checkpoint exists.

Clicking this button will bring you to a page that:

  • Shows you if a checkpoint currently exists
  • How much space is being reserved for the checkpoint
  • How old the checkpoint is
  • Buttons to create or remove a checkpoint
  • Disclaimers about the pool if a checkpoint exists

A warning should pop up when taking a new checkpoint: “The current checkpoint will be discarded and replaced with the new one. Only one checkpoint can exist in a pool at any time.”

Note to TrueNAS devs: Information about a checkpoint, such as creation time and size, can be extracted with zpool status <pool> | grep checkpoint and zpool get checkpoint <pool>.


Automatic Triggers
Borrowed from @dan’s idea in post #2.

Checkpoints should be automatically triggered immediately before a pool-modification action, such as when adding a new vdev.

They should be automatically triggered immediately before “upgrading” a pool or enabling new pool features.

They should be automatically triggered immediately before destroying an entire dataset.

They should be automatically triggered immediately before rolling back a dataset to a snapshot. (Snapshot rollbacks are destructive operations that cannot be reversed.)

There needs to be a pruning policy in place so that a checkpoint does not become “stale” or lose its usefulness. Maybe a maximum one-week life? This will allow the user enough time to rewind their pool in an emergency, without allowing the checkpoint to become too “stale”.

If there is no automatic pruning or “refresh” policy in place, then there should be a visual indicator that a checkpoint exists for the pool. This will leave it up to the user. (Taken from @dan in post #4.)


Automatic Schedule
Add a menu that allows a user to create a task that automatically takes a checkpoint on a schedule. (This might not be needed if “Automatic Triggers” are implemented as explained above.)

This can be its own menu or within the “Manage Checkpoint” page for each pool. (Shown above.)

The user should be able to schedule a checkpoint to be taken daily, weekly, or any custom schedule.

The page should include a recommendation of “daily” at 03:00. This allows the pool to always have a “fresh” checkpoint with enough time for the user to rewind the pool before the “good” checkpoint is overwritten the next time the task is run.

There should be a disclaimer that a pool can only have one checkpoint at any given time. It should inform the user that the task should be paused if they do not want the current checkpoint to be lost.

It needs a pause button. Pausing a checkpoint task is very important. Unlike with snapshots, a pool can only have one checkpoint that exists.

Unlike with Automatic Triggers, having a routinely “refreshed” checkpoint can create a safety net for unforeseen emergencies. (I listed some examples in the referenced threads at the end of this post.)


Considerations for TrueNAS Devs
TrueNAS cannot just make a simple button and automatic trigger for pool checkpoints. Its code and middleware must also have “safety checks” for other pool operations, since the presence of a checkpoint is incompatible with certain actions and setups.

The documentation and tooltips must make it clear that rewinding a pool to its checkpoint will destroy everything that was saved after the checkpoint’s creation.

The documentation and tooltips must make it clear that any “hot spares” in the pool will not activate if a checkpoint exists.

If attempting to remove, modify, or expand a vdev, or resilver a drive, the operation should be greyed out with a message that explains it cannot be done until the checkpoint is removed.

The Pool Import wizard should include an option to rewind to a checkpoint. A clear warning must be accepted by the user for them to continue, since rewinding a pool can and will destroy all data that is newer than the checkpoint. Selecting the “Rewind to checkpoint” option should also prompt a “readonly” option if the user does not want to commit to a rollback. (This is useful for recovery purposes.)


Is this feature request good?

Yes

THEN VOTE FOR THIS FEATURE RIGHT NOW.:index_pointing_at_the_viewer:

No

Of course it’s not good. It’s great!


If pool checkpoints are implemented, this feature needs to be highlighted and advertised by TrueNAS since it can save a lot of new users from permanently losing precious data or messing up their pools.[1][2][3][4][5][6][7][8]

*There is a lot that I did not write in this feature request because it requires more involved discussions. I’ll wait for feedback and questions.


  1. How to recover from rm -rf /* ? (lost everything) ↩︎

  2. Accidentally added a mirror to a pool ↩︎

  3. Removing a device i shouldn't have added to pool ↩︎

  4. ZFS ZPool Import Fails ↩︎

  5. Feature Request: Allow Safe Removal of Unused Dedup VDEVs (OpenZFS #17194) ↩︎

  6. Ran replication task on wrong target - is a recovery even possible ↩︎

  7. Issue importing ZFS Pool ↩︎

  8. Scale Created Special Meta Device Non-Mirrored ↩︎

2 Likes

Agreed. This is an important safety belt that ZFS offers, and it’s baffling that TrueNAS doesn’t expose it.

I don’t know that I like the idea of automatic checkpoints on a schedule–that sounds more like something that should be handled by snapshots.

I’d add that it should come up when you’re modifying a pool–at a minimum, when you’re adding a vdev. I’m tempted to think a checkpoint should just be automatically created at that time, but at least offer the user an option to create one.

1 Like

This is actually more practical than a scheduled task, since those are the times when checkpoints and most needed.

However, if you read the linked thread, I gave examples of data loss that snapshots cannot protect against, whereas an automatic schedule could save someone by always maintaining a “fresh” checkpoint every day. (Of course, a PEBKAC error is still the fault of the user.)


I agree that it should automatically be triggered immediately before a pool modification task, such as adding a vdev to the pool. Wouldn’t it also make sense to also trigger it before destroying an entire dataset?

Without a scheduled task, how should pruning be handled? Checkpoints are most useful when they are “fresh”. Stale checkpoints can also hold onto space that would otherwise be freed if pruned or retaken.

Let’s say someone adds a special vdev. A checkpoint is taken immediately before the vdev is added. After a couple days, they are happy and do not need to rewind their pool. If they forget about the checkpoint and they have no automatic schedule, should there be a default expiration time for any existing checkpoint? Maybe one week?

I’d think so, and for the reason you state–snapshots belong to datasets, so destroying a dataset also gets rid of the snapshots.

I’d vote for “manually,” but with a pretty prominent GUI warning that a checkpoint exists–if it were up to me, I’d put it as a banner across the top of the page.

1 Like

I edited the request to include this. :+1:

EDIT: FYI, I’m not a dev, so I cannot speak on how difficult this would be to implement. I am aware that it’s not a simple feat. Different actions and components in the middleware must pass “checks”.

Just one example: If a checkpoint exists, you cannot remove a vdev from a pool. (I’d argue this is a good thing, since it’s better to “inconvenience” the user for a couple minutes, rather than not provide any safety rails at all, which could result in regrets or data loss. Causing someone to pause before proceeding with an irreversible action is a sane approach.)

Seems like a very important feature. To be honest it is the first time I read about checkpoints. Are they a new zfs feature?

In terms of human history, the year 2016 is “recent”. I guess you could say they’re “new”. :wink:

1 Like

I think “great” is an understatement for this feature request - extremely useful and well thought through.

2 Likes

Thank you.