Manage and view pool "checkpoints" in the GUI

Johnny_Fartpants · November 30, 2024, 4:34pm

I think you’re perhaps misunderstanding the use and role of a checkpoint.

Johnny_Fartpants · November 30, 2024, 4:40pm

When you create a checkpoint ideally you should already know when it will be discarded and therefore automate this process if possible. In my opinion there are two different occasions checkpoints help.

You plan to undertake some work on your TrueNAS and you are concerned things could go south then create a checkpoint.
Something went bad that was unplanned and wouldn’t it be nice IF I had a checkpoint.

1 is a manual thing 2 is an automated thing.

winnielinnie · November 30, 2024, 4:43pm

Yes. Yes it would be nice.

Johnny_Fartpants · November 30, 2024, 4:45pm

IMHO it’s probably the best feature request to date (that doesn’t exist yet )

winnielinnie · November 30, 2024, 4:48pm

I’m working on it…

For what it’s worth, I wrote a feature request twice.

The first one was “closed” on Jira because we’re told to use the forums for our feature requests.

The second one is this thread, which is not in the Feature Requests category.

I don’t want to simply make a request. I want to lay out how this can be implemented to work with TrueNAS. To simply manage checkpoints in the GUI is not comprehensive enough for a NAS appliance.

Johnny_Fartpants · November 30, 2024, 4:49pm

Look forward to supporting it.

etorix · November 30, 2024, 5:31pm

I think that Constantin has a different idea of the use and role of a checkpoint, and that you probably should begin by writing out YOUR idea in detail.

For myself I can see the case for manually creating a checkpoint immediately before a change, and I can see a use for the GUI nagging me if this checkpoint is still in existence a few hours after its creation. I do not see any use for a persisting checkpoint, and I do not want of automatically managed checkpoints pointing to an unknown arbitrary state of the pool, and potentially preventing some pool operations.

In the words of Mr. Checkpoint himself (my emphasis):

ZFS Pool "Checkpoints": They work just like seatbelts! (Not really)

You are not supposed to “sit” on a checkpoint: After you create one and then do some “stuff”, you should very soon make a decision on whether you want to discard the checkpoint or rewind to it

You cannot remove or modify vdevs if a checkpoint exists

You can add a new vdev after creating a checkpoint, in which rewinding the checkpoint will act as if the new vdev (including any files saved after its addition) never existed

TL;DR: What should I do?

You want to try something that affects the entire pool or dataset(s). This includes “upgrading” pool features, destroying datasets or snapshots, adding a new vdev, trying out a batch script that uses zfs commands, receiving a replication stream to a dedicated backup pool that you might reconsider, and so on

Before doing so, you create a checkpoint with zpool checkpoint mypool

You go ahead and continue with whatever you decided on

You assess the results. You need to make a decision, since it’s unwise to let a checkpoint “sit” in a pool for too long.

Johnny_Fartpants · November 30, 2024, 6:00pm

Indeed. See above my idea

Johnny_Fartpants · November 30, 2024, 6:04pm

In a lot of the issues we have seen on the forums where checkpoints would have been helpful a manual checkpoint wouldn’t have been helpful because the user wouldn’t have created one before their issue. I can’t see how a weekly automated checkpoint would hurt anyone however I can see how it would potentially save the day.

I’m a bit old fashioned and my number one rule is don’t lose anyone’s data. After that everything comes second.

winnielinnie · November 30, 2024, 6:10pm

This would make the user pause and stop (before doing something like removing a vdev). It takes only a moment to remove the checkpoint, and now they can proceed, that’s what they want to do.

This is preferable to having no safety net at all, wouldn’t you agree?

“Hey, I know you permanently lost all your data, and you have no checkpoint to rewind back to, but isn’t it nice that there was no week-old checkpoint that could have possibly inconvenienced you for a few minutes if you had decided to remove vdevs or something?”

PhilD13 · December 3, 2024, 11:43pm

@winnielinnie Before I start this post, both servers are on Dragonfish-24.04.2.5 and each server has only used about 28-30% space on Pool1. Server 2 backs up Server 1 by an Rsync task on Server 1 run from CRON each night. Each server specs. are listed in my signature.

I ran the checkpoint command on the older secondary server (Server2 - 69Tb free) and it seems to be as it should be without any increasing size and almost immediate. This is the secondary server listed in my signature. These status checks are over a 15 minute period with a 30 minute check for posting purposes and no change in size value:

root@owen:/home/admin# zpool get checkpoint Pool1
NAME PROPERTY VALUE SOURCE
Pool1 checkpoint - -
root@owen:/home/admin# zpool checkpoint Pool1
root@owen:/home/admin# zpool get checkpoint Pool1
NAME PROPERTY VALUE SOURCE
Pool1 checkpoint 3.06M -
root@owen:/home/admin# zpool get checkpoint Pool1
NAME PROPERTY VALUE SOURCE
Pool1 checkpoint 3.06M -
root@owen:/home/admin# zpool status Pool1 | grep checkpoint
checkpoint: created Tue Dec 3 16:51:58 2024, consumes 3.06M
root@owen:/home/admin# zpool status Pool1 | grep checkpoint
checkpoint: created Tue Dec 3 16:51:58 2024, consumes 3.06M
after 30 min check:
root@owen:/home/admin# zpool status Pool1 | grep checkpoint
checkpoint: created Tue Dec 3 16:51:58 2024, consumes 3.06M

The other newer server (Server1 - 59Tb free) is the primary server in my signature and acts very differently from the older one. These readings are over a 15 minute period with a 30 minute check for posting. As you can see the checkpoint size continues to climb, slowly but continues to climb.

root@neo[/home/admin]# zpool get checkpoint Pool1
NAME PROPERTY VALUE SOURCE
Pool1 checkpoint 8.88M -
root@neo[/home/admin]# zpool get checkpoint Pool1
NAME PROPERTY VALUE SOURCE
Pool1 checkpoint 9.81M -
root@neo[/home/admin]# zpool status Pool1 | grep checkpoint
checkpoint: created Tue Dec 3 16:59:31 2024, consumes 12.9M
root@neo[/home/admin]# zpool status Pool1 | grep checkpoint
checkpoint: created Tue Dec 3 16:59:31 2024, consumes 14.4M
root@neo[/home/admin]# zpool status Pool1 | grep checkpoint
checkpoint: created Tue Dec 3 16:59:31 2024, consumes 16.7M
root@neo[/home/admin]# zpool status Pool1 | grep checkpoint
checkpoint: created Tue Dec 3 16:59:31 2024, consumes 17.0M
root@neo[/home/admin]# zpool status Pool1 | grep checkpoint
checkpoint: created Tue Dec 3 16:59:31 2024, consumes 17.6M
root@neo[/home/admin]# zpool status Pool1 | grep checkpoint
checkpoint: created Tue Dec 3 16:59:31 2024, consumes 17.7M
root@neo[/home/admin]# zpool status Pool1 | grep checkpoint
checkpoint: created Tue Dec 3 16:59:31 2024, consumes 17.8M
after 30min check:
root@neo[/home/admin]# zpool status Pool1 | grep checkpoint
checkpoint: created Tue Dec 3 16:59:31 2024, consumes 18.3M

As you can see there is something fishy going on with the newer server1. and while I’m only using 30% of pool space, eventually if left unchecked the checkpoint may eat all the free pool space if it does not stop doing whatever it is doing at some point a person did not notice and just left it there for awhile.

If this is something that can possibly happen, then people could possibly be depending upon an incomplete/broken checkpoint which would not be good if it was needed.

I’m going to let both be for awhile and see if server1 ever finishes whatever it is doing.

winnielinnie · December 4, 2024, 12:22am

Does it contain your “System Dataset”?

Johnny_Fartpants · December 4, 2024, 7:20am

Checkpoints should only ever be temporary. How long temporary means depends on your personal situation. It could be an hour, day, week, month or even longer perhaps but depends on your usage and available space.

Checkpoints growing in size is perfectly normal as data diverges from the checkpoint itself. Once the checkpoint is discarded you go back to zero consumption and start the process again with a new checkpoint.

PhilD13 · December 4, 2024, 1:59pm

@Johnny_Fartpants That is well and good and I understand after being exposed to the concept what a checkpoint is and what it is for and that it is completely temporary designed as only a safety net. If the checkpoint never completes and for some reason is constantly working to update itself then it is not a checkpoint in time is it?

PhilD13 · December 4, 2024, 2:18pm

@winnielinnie I never thought of that as a possible cause.

Yes for the main system for the current 3 apps I run on the main system, the ix-applications dataset is on Pool1 as are the APP_Configs dataset for the three apps. The main system only has 1 data pool.

On the secondary older server there are 2 data pools, the app-configs and the ix-applications datasets are on pool2 which is a different pool and not the one I ran the command on which was pool1.

Would the ix-applications dataset and or the application-configs account for the increasing data space usage of the checkpoint on the main system?

As a checkpoint is supposed to be an additional layer of protection is it safe then to make and use a checkpoint if the checkpoint is always changing?

winnielinnie · December 4, 2024, 2:23pm

Never completes? It’s like a snapshot. It is created instantly. The “growing in size” is the same reason a snapshot consumes more space over time, as data is “destroyed”, yet remains held by the snapshot.

Same with checkpoints.

A checkpoint doesn’t change. It remains as it is, until you discard it.

I have a Cron Job that discards and creates a new checkpoint for my pools every night at 03:00.

For the pure storage pool, the checkpoint barely takes up any space between jobs.

For the pool that contains the System Dataset, it will sometimes climb to a couple hundred MiB, since TrueNAS’s System Dataset is constantly deleting data.

Because of the constant “discard then create checkpoint” every night at 03:00, it never holds onto a checkpoint for too long, nor does the checkpoint’s size consume any significant amount of space.

Of course, if I were to accidentally delete an entire dataset, then the checkpoint might “consume” a TiB of space, in which I can rewind back to the previous time at 03:00 to reclaim my pool.

PhilD13 · December 4, 2024, 2:29pm

@winnielinnie I did confirm just now on the secondary system Pool2 that the existence of the ix-applications dataset and the app-configs dataset (probably the ix-applications dataset is the cause) of the steadily increasing size of the checkpoint. Currently after 5 minutes the checkpoint size on the Pool2 checkpoint test went from 1.61M to 6.2M

This morning the size of the checkpoint on the main system was still going and was at 6.2Gb. I have since deleted the checkpoint, but yes the checkpoints never complete if the ix-applications dataset (I don’t see where app config files would cause this) is on a data pool

winnielinnie · December 4, 2024, 2:31pm

I don’t understand what this means. There is no “completing” a checkpoint. Once the command exits, you’re done.

It’s like saying “My dataset’s snapshot never completes, because it keeps reporting an increase in used space over time.”

This means that your pool “deleted / destroyed” over 6 GiB worth of data since you created your checkpoint.

However, you shouldn’t keep a checkpoint more than a day or two. Maybe three days at most.

This is my Cron Job, which runs every night at 03:00:

Name: Nightly checkpoint
Command (single line): zpool checkpoint -d -w mypool; zpool checkpoint mypool
Run As User: root
Schedule: 0 3 * * * (03:00 every night)

I intentionally chose an hour in which I am unlikely to use the NAS, and hence make a “mistake”. This also assures me that a new checkpoint will not be created until the next day, which gives me time to disable the Cron Job while I figure out what to do in an emergency situation. (If I don’t disable the Cron Job, and decide to wait a day to figure out how to deal with the emergency, then it means I will lose the “good” checkpoint, since the Cron Job will replace it the next day!)

EDIT: I would rather there be a fully supported and integrated feature in the GUI, but this is a crude way to do it for now. I might consider spacing it out to “every 2 nights”, to prevent a new checkpoint being issued if/when I need to resort to rewinding in an emergency situation.

EDIT 2: TrueNAS seems “write heavy”, even for idle systems. I blame their concept of a “System Dataset” and whatever their “ix-apps” datasets seem to do on SCALE. I hope this trend doesn’t continue when they bring “Linux jails” to SCALE. I don’t want my idle NAS to keep deleting, writing, and modifying files because “reasons”.

PhilD13 · December 4, 2024, 5:22pm

I would only plan to make a checkpoint if I thought I might need it and get in the habit of doing it. You make a good point for automation of the task. It is easy enough to accidentally destroy something outside of a filesystem. I have seen it happen and it took a long time to rebuild their system from backups.

Maybe I’m not using quite the right terminology. When I say the checkpoint never completes, I mean the following. On the secondary system I make a checkpoint. This creation is nearly instant and that is time/date stamped. and at this point it is complete and is a fixed point created at that time and date much like a snapshot would be. I agree with everyone that this checkpoint is complete, valid at time of creation, and can be used as a safety net or in an emergency and then removed when the task the checkpoint was created for has been completed successfully.

As we have identified on a different system; If I create a new checkpoint, the checkpoint initially again takes mere seconds to be created and get a time/date stamp on a data pool that ALSO has an ix-applications dataset within the data pool, then the checkpoint reported size will increase over time. You can watch the increase by using either
zpool get checkpoint mypool or zpool status mypool | grep checkpoint the creation of a checkpoint will initially show a value for the amount of space taken. That value if checked over time will increase over time even on an idle or lightly used system. So while this checkpoint initially is like a snapshot and a fixed point, because of the ix-applications dataset read/write/whatever the checkpoint is not a fixed point but rather a live moving point that is continuously recording all of whatever ix-apps dataset is doing within the Pool

I am questioning the wisdom of having a checkpoint working to record stuff while supposed to be a checkpoint in time and then possibly also failing or be corrupted during the failure we are expecting to be protected from and not able to be used for recovery.

My backup systems are pretty standard as far as setup of pools, vdevs, datasets, SMB, etc. with the exception I went with the recommendation back around Bluefin to put ix-applications dataset on the data pool.

I did notice in TOP that the only thing really using a significant amount (15.1%) of processor on a basically idle system (0.3% to <2%) is the k3s-server and this seems to correspond by casual viewing to the increase in size of the checkpoint.

Johnny_Fartpants · December 4, 2024, 5:41pm

This is one use case but what about the time you didn’t think you’d need it?

Let’s say you accidentally deleted a dataset or somehow someone maliciously did. Wouldn’t it be nice to be able to recover everything instantly to how it was either the day or the week before?

Most people on these forums that would have benefited from a checkpoint didn’t have one. 1. Because they didn’t know they existed and 2. Because they didn’t think they needed one.

It’s a bit like wearing a seatbelt in a car. If you only ever put it on when you thought you were going to have a crash you’d never use it.