Apps fail to start after restoring pool from backup

stk · April 26, 2024, 12:07am

Got this after I backed up my pool and restored to a reconfigured pool with the same name Apps failed to start:
FAILED:[EFAULT] Command mount -t zfs main/ix-applications/k3s/kubelet /var/lib/kubelet failed (code 1):

Full message:
[EFAULT] Command mount -t zfs main/ix-applications/k3s/kubelet /var/lib/kubelet failed (code 1): filesystem ‘main/ix-applications/k3s/kubelet’ cannot be mounted using ‘mount’. Use ‘zfs set mountpoint=legacy’ or ‘zfs mount main/ix-applications/k3s/kubelet’. See zfs(8) for more information.

Apps won’t start.

If you type
k3s server:
you get

E0425 16:55:59.436310 1473438 kubelet.go:1466] "Failed to start ContainerManager" err="invalid kernel flag: vm/overcommit_memory, expected value: 1, actual value: 0"

If I type this into CLI:

mount -t zfs main/ix-applications/k3s/kubelet /var/lib/kubelet

it returns:

root@truenas[~stk]# mount -t zfs main/ix-applications/k3s/kubelet /var/lib/kubelet
filesystem 'main/ix-applications/k3s/kubelet' cannot be mounted using 'mount'.
Use 'zfs set mountpoint=legacy' or 'zfs mount main/ix-applications/k3s/kubelet'.
See zfs(8) for more information.
root@truenas[~stk]#

How do I fix this?

NickF1227 · April 26, 2024, 12:11am

What did you do?

Can you take a look here:

Make sure a pool is set:

stk · April 26, 2024, 1:18am

Yup. Tried that. No go.

I filed a bug just now: NAS-128555 which has screenshots of various things I tried.

Here’s the full description of how I got there and what I’ve tried. Sorry it’s long.

You can just start at step 34.

I decided to use dRAID1 instead of RAIDZ1 due to higher performance and faster rebuild. My disks are huge so a Z1 rebuild would take a long time. The price I pay is giving up 50% of the capacity, so I get 24 GB from my 4 12 TB disks.
2. add nvme card to slot near 4 eth ports and chelsio 10G eth card to supermicro opposite the HBA. Put 1G eth cable into IPMI slot. Route the 10G SPI+ DAC cable to the UDM Pro 10G port. Note: SuperMicro documentation is terrible. To remove the riser, you have to lift the two black tabs in the back, and then you can pull straight up on the riser. There are NO screws you need to remove. So now 10G is the only way in/out.
3. Config the NVME disk in TrueNAS as a new pool SSD
4. insert the 2 more disks into the supermicro, but don’t create a pool yet. Make sure they are recognized. They were. Flawless. To pre-test the newly added disks, see: Hard Drive Burn-in Testing | TrueNAS Community. Just make sure you are NOT specifying the old disks… the system assigns different drive letters on each boot.
6. save TrueNAS config (just to be safe)
7. Make sure there is a periodic snapshot task that snaps Main recursively. You can make the frequency once a week for example. You need this for the next step. I used the custom setting so that the first one would be 5 minutes after I created the task. This meakes the next step easier since I will have an up to date snapshot. You must check “Allow taking empty snapshots.” The reason you need to check this is that the full filesystem replication makes sure every snapshot is there. If you recently snapshotted a dataset which didn’t change, this check will fail to create a snapshot and the sanity check will fail so you won’t be able to replicate the filesystem.
8. Use the GUI to copy everything in the main pool to SSD pool by using the ADVANCED REPLICATION button in Data Protection>Replication Tasks. Transport is LOCAL. “Full Filesystem Replication” is checked. Source and dest are the pool names: Main and SSD. Read only policy is set to IGNORE. You do NOT want to set this READ only. Replication schedule: leave at automatically. Pick the replication task you just configured for the replication task. Set retention to be same as source. Leave everything else at the default. Automatically will start the replication job right after the snapshot job above finishes.
9. While it is doing a full filesystem replica, it will unmount the SSD pool so if you ssh to the system, /mnt/SSD will be gone. So if you look at the Datasets page and refresh, it will give you an error about not being able to load quotas from SSD.
10. If you click on the Running button, you can see the progress. You can also go to the Datasets page and expand the SSD pool and see that it is populating and that the sizes match up with the originals. You can also click on the Jobs icon in the upper right corner of the screen (left of the bells); that screen updates every 30 seconds to show you the total amount transferred.
12. Migrate system dataset from main pool to SSD pool (so I have a system when I disconnect) using System >Advanced> System Dataset, then select the SSD pool and save.
13. Use GUI (Storage>Export/Disconnect) to disconnect MAIN pool (do not disconnect the SSD pool). If you get a warning like “This pool contains the system dataset that stores critical data like debugging core files, encryption keys for pools…” then you picked the wrong pool. Select to delete the data but do NOT delete the config information. The config info is associated with the pool name, not the pool. You will get a warning with a list of all the services that will be “disrupted.” That’s to be expected.
14. ![[Pasted image 20240424231154.png]]

You will need to type “main” below the checkboxes to confirm the destroy and disconnect… Make sure the middle selection is NOT checked. You want it just like above.
The export will take a little while.
When I first tried it, I got: "[EFAULT] cannot unmount '/mnt/main/user': pool or dataset is busy" and libzfs.ZFSException: cannot unmount '/mnt/main/user': pool or dataset is busy"
I tried it again and it said Kubernetes is still referencing main. So I did the same process a second time and it failed: [EFAULT] cannot unmount '/mnt/main/user': pool or dataset is busy. This is because I was ssh’ed into the truenas system and in my home directory.
So i killed my ssh and tried the export a third time. Remember to adjust the options as above each time!! The system will NOT default to this (it defaults to the last 2 options)
It worked! “Successfully exported/disconnected main. All data on that pool was destroyed.”
Storage now shows I have 4 unassigned disks!!! Perfecto!
Now click that “Add to pool” button! ![[Pasted image 20240424232451.png]]
Use GUI (“New Pool”) to create a brand new main pool with all four disks with same name as my original main pool. Select Data>Layout of dRAID1 since I have large 12 TB disk drives and I want performance. Use the exact same name as before (including correct case). Data devices should be 2 (which uses 3 disks) and hot spare=1 and so children=4, vdev=1. t will show in the summary as: “Data: 1 × DRAID1 | 4 × 10.91 TiB (HDD)”. Then it says: “In order for dRAID to overweight its benefits over RaidZ the minimum recommended number of disks per dRAID vdev is 10.” I’m not buying it. The final config looks like this which was really surprising to as I thought it would reserve one physical spare disk which would only be provisioned if one of the drives died. Can someone comment on this?
![[Pasted image 20240425002245.png]]
I just check the upgrade thread and the advice was “dRAID only make sense if you have multiple raidz# vdevs with hot spares, for more than 20 drives in total. And flexibility is even worse than raidz#.” OK, so I can redo the restore since I have the original backup, no problem.
At this point, look at your SMB shares. They are preserved!!! So are the snapshot tasks, replication tasks, etc. in the Data Replication page. So looking great!!! It’s all downhill from here!
Just for safety, let’s export the config at this point. (at 12:25am on 4/25)
This is the time to periodically snapshot the SSD. Do the snapshot process we did above; a periodic snapshot of the SSD pool just like we did before. This sets you up for the transfer back to the main pool of the data that was there originally. So recursive, custom start time (weekly starting in a 15 minutes) to give us time to create the Replication Task that will start automatically after the SSD full recursive snapshot.
Use GUI again to copy data over from SSD pool to the new main pool. So same steps as before. We need to use the SSD snapshot and reverse the source and destination and use the SSD snapshots. So I just loaded the previous replication task and started with that, but when I went to advanced, it blanked it out! So I have to re enter everything from scratch!
It failed this time. I got the message: middlewared.service_exception.ValidationErrors: [EINVAL] replication_create.source_datasets: Item#0 is not valid per list types: [dataset] Empty value not allowed which sound like it thinks I’m trying to replicate from an empty source. I’m going to bring up the menu again, but this time I’m not going to try to load from the OLD job, but enter everything from scratch. I think this is a big in retrieving the old replication task job. So bring up the dialog and go straight to advanced and fill it out from scratch. Do not try to edit the old job.
I was right!!! Whew, what a relief.
Hit reload to see your SSD to main Replication Task. It’s status will be pending. It will run right after the SSD recursive snapshot is done. Hit reload and the periodic snapshot tasks will be “Finished” and the “SSD to main” replication task will be running.
You can monitor progress as before.
After the replication finishes, it’s time to move the system dataset to the new main pool using GUI using the process used above.
It says “Apps Service Stopped”. restart apps Service. Not sure how to do that. it isn’t in System>Services. No clue. Hopefully a reboot will fix it.
It wasn’t serving any SMB shares (the switch was flipped to off for all the shares) so I had to flip all the shares to enabled, and that did the trick!
Done.
Now disable that replication task (from SSD → main). And possibly from main → SSD if you want to use the SSD for a different use.
Reboot just to make sure there are no issues. There shouldn’t be, but I’d rather find out now than later.
Reboot didn’t fix k3s not starting. E0425 03:13:06.366461 52694 kubelet.go:1466] “Failed to start ContainerManager” err=“invalid kernel flag: vm/overcommit_memory, expected value: 1, actual value: 0” that was the final line in a CLI k3s start command. Ugh.
Tried switching apps to use SSD for the app pool. Got message: “mount -t zfs SSD/ix-applications/k3s/kubelet /var/lib/kubelet failed (code 1): filesystem ‘SSD/ix-applications/k3s/kubelet’ cannot be mounted using ‘mount’. Use ‘zfs set mountpoint=legacy’ or ‘zfs mount SSD/ix-applications/k3s/kubelet’.” And You can’t select the migrate apps to a new pool checkbox because the destination has to be empty.
Now switch to main again. Now I get the error: [EFAULT] Command mount -t zfs main/ix-applications/k3s/kubelet /var/lib/kubelet failed (code 1): filesystem 'main/ix-applications/k3s/kubelet' cannot be mounted using 'mount'. Use 'zfs set mountpoint=legacy' or 'zfs mount main/ix-applications/k3s/kubelet'. See zfs(8) for more information. Reason: already mounted! So lets umount and try again from GUI to set pool to main. No error this time but app service still isn’t running “Error in Apps Service”
It appears the error message about the overcommit has happened before. Now reading: “[RESOLVED] Can't start k3s after replacing disk | TrueNAS Community” and Installed Apps and Docker Container gone after reboot; Kubernetes is not running | TrueNAS Community. so I unset pool, then I set pool, but got error on mount of kubelet (since already mounted). So I manually unmounted kubelet and tried again. No error from set pool but still getting error in apps service.
Try this: Move system to SSD. Try export of main (just check the final checkbox, not the other) and import pool (there will be one choice which is what you just exported) and see if that changes anything (suggested in the first link). Nope.
Error says “Application(s) have failed to start: Command ‘(‘mount’, ‘-t’, ‘zfs’, ‘main/ix-applications/k3s/kubelet’, ‘/var/lib/kubelet’)’ returned non-zero exit status 1.”
So I tried it and it failed: mount -t zfs main/ix-applications/k3s/kubelet /var/lib/kubelet
The key error is: [EFAULT] Command mount -t zfs SSD/ix-applications/k3s/kubelet /var/lib/kubelet failed (code 1): filesystem ‘SSD/ix-applications/k3s/kubelet’ cannot be mounted using ‘mount’. Use ‘zfs set mountpoint=legacy’ or ‘zfs mount SSD/ix-applications/k3s/kubelet’. See zfs(8) for more information. Same with setting app pool to main