Linux Jails (containers/vms) with Incus

dasunsrule32 · May 26, 2025, 9:48pm

The biggest thing you’re going to run into is adjusting to the new userns_idmap’s. If you folllow the OP, then you should be fine moving forward. I will be updating the OP as I go as well when upgrading versions.

Those configs in the OP are designed to get you going with docker inside incus. Let me know if you have any questions.

skittlebrau · May 27, 2025, 12:51am

Thanks for your efforts in keeping the thread updated and current

I’m not in a hurry, but will test out one of my docker stacks properly and then migrate as needed. I’ve migrated my overall stack a lot of times, so it’s not a big deal for me anyway, plus my setup is not complicated.

dasunsrule32 · May 29, 2025, 4:15am

Have to figure out what has changed in 25.04.1, but none of my Nvidia containers will start.

etorix · May 29, 2025, 8:27am

Have you filed a bug report? This looks like what iX wants to know about in order to “bake” the release.

LarsR · May 29, 2025, 8:28am

according to this its an upstream bug with nvidia toolkit.
Had the same problem and found that post after 2 hours of googling.

dasunsrule32 · May 29, 2025, 1:23pm

I rolled back because it was late and didn’t have time to diagnose it yet. I’ll try again later.

dasunsrule32 · May 29, 2025, 1:25pm

I will check and see what version I’m on. I’ll update as I have time. I got sick, so I’m not going to be doing much today. Thanks for the pointer.

dasunsrule32 · May 29, 2025, 11:22pm

I doubt that is my issue, since I’m on the previous version…

incus exec apps1 -- apt list --installed |grep nvidia-

libnvidia-container-tools/unknown,now 1.17.6-1 amd64 [installed,automatic]
libnvidia-container1/unknown,now 1.17.6-1 amd64 [installed,automatic]
nvidia-container-toolkit-base/unknown,now 1.17.6-1 amd64 [installed,automatic]
nvidia-container-toolkit/unknown,now 1.17.6-1 amd64 [installed]

I will attempt to roll forward again tomorrow.

dasunsrule32 · May 30, 2025, 6:21pm

New version is out:

nvidia-container-toolkit-base/unknown 1.17.8-1 amd64 [upgradable from: 1.17.6-1]
nvidia-container-toolkit/unknown 1.17.8-1 amd64 [upgradable from: 1.17.6-1]

dasunsrule32 · May 30, 2025, 6:28pm

They start fine on that version.

Going to test upgrading to 25.04.1 again shortly and see what I get…

moep · May 31, 2025, 9:50am

I currently see the following pitfalls with the current GUI implementation of incus (which can do much more. and it would be very beneficial when more features would be GUI exposed, like clustering)

the hidden .ix-virt dataset makes it impossible to do snapshot in a reasonable way and it therefore disables replication for VMs (and containers)
root volumes are presented automatically but cannot be deselected even when importing existing zvols, creating trash entries which are confusing and cannot be deleted even after VM creation.
at the same time, the imported former zvols cannot be changed in size via GUI. so you must choose: use root volumes and be able to incease size via GUI or use the imported zvol and not to be able to increase size.

As the current state is experimental, this might and hopefully will change for the better. currently it is a massive degradation in contrast to the “old” libvirt based approach.
Is there a roadmap anywhere, in which we could see what features will be available and where the whole VM/container stack will be headed?

Captain_Morgan · June 1, 2025, 7:05pm

This is appropriate for “Feature Request”. “2” could be treated as a bug.
It looks like there is a bug report to simplify “removing of drives.”
https://ixsystems.atlassian.net/browse/NAS-135379

There is ongoing work on the Goldeye improvments… we should be in a position to clarify in July.

dasunsrule32 · June 1, 2025, 11:06pm

Nvidia containers still won’t start. Seeing the following in journalctl:

Jun 01 18:54:26 jupiter incusd[7089]: time="2025-06-01T18:54:26-04:00" level=error msg="Failed starting instance" action=start created="2025-04-17 18:49:01.443738179 +0000 UTC" ephemeral=false instance=apps1 instanceType=container project=default stateful=false used="2025-05-29 04:22:06.538419005 +0000 UTC"
Jun 01 18:54:26 jupiter incusd[7089]: time="2025-06-01T18:54:26-04:00" level=warning msg="Failed auto start instance attempt" attempt=1 err="Failed to run: /usr/libexec/incus/incusd forkstart apps1 /var/lib/incus/containers /run/incus/apps1/lxc.conf: exit status 1" instance=apps1 maxAttempts=3 project=default
Jun 01 18:54:26 jupiter audit: ANOM_PROMISCUOUS dev=veth07fb3a94 prom=0 old_prom=256 auid=4294967295 uid=0 gid=0 ses=4294967295
Jun 01 18:54:26 jupiter audit[7422]: SYSCALL arch=c000003e syscall=46 success=yes exit=40 a0=3 a1=7ffe300368b0 a2=0 a3=7ffe300362d4 items=0 ppid=7089 pid=7422 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ip" exe="/usr/bin/ip" subj=unconfined key=(null)
Jun 01 18:54:26 jupiter audit: SOCKADDR saddr=100000000000000000000000
Jun 01 18:54:26 jupiter audit: PROCTITLE proctitle=6970006C696E6B007365740064657600766574683037666233613934006E6F6D6173746572
Jun 01 18:54:26 jupiter kernel: veth07fb3a94: left allmulticast mode
Jun 01 18:54:26 jupiter kernel: veth07fb3a94: left promiscuous mode
Jun 01 18:54:26 jupiter kernel: br5: port 2(veth07fb3a94) entered disabled state
Jun 01 18:54:27 jupiter systemd[1]: var-lib-incus-devices-apps1-disk.stacks.opt\x2dstacks.mount: Deactivated successfully.
Jun 01 18:54:31 jupiter incusd[7089]: time="2025-06-01T18:54:31-04:00" level=warning msg="Failed auto start instance attempt" attempt=2 err="The instance is already running" instance=apps1 maxAttempts=3 project=default
Jun 01 18:54:34 jupiter audit[7481]: USER_LOGIN pid=7481 uid=0 auid=4294967295 ses=4294967295 subj=unconfined msg='op=login acct="user" exe="/usr/sbin/sshd" hostname=? addr=192.168.2.91 terminal=sshd res=failed'
Jun 01 18:54:36 jupiter incusd[7089]: time="2025-06-01T18:54:36-04:00" level=warning msg="Failed auto start instance attempt" attempt=3 err="The instance is already running" instance=apps1 maxAttempts=3 project=default
Jun 01 18:54:36 jupiter incusd[7089]: time="2025-06-01T18:54:36-04:00" level=error msg="Failed to auto start instance" err="The instance is already running" instance=apps1 project=default
Jun 01 18:54:37 jupiter audit: ANOM_PROMISCUOUS dev=vethe18496eb prom=256 old_prom=0 auid=4294967295 uid=0 gid=0 ses=4294967295
Jun 01 18:54:37 jupiter audit[7518]: SYSCALL arch=c000003e syscall=46 success=yes exit=40 a0=3 a1=7ffd53d5bf00 a2=0 a3=0 items=0 ppid=7089 pid=7518 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ip" exe="/usr/bin/ip" subj=unconfined key=(null)

Attempting to start results in:

 Error: lxc apps1 20250601230649.755 ERROR    conf - ../src/lxc/conf.c:run_buffer:322 - Script exited with status 1
lxc apps1 20250601230649.755 ERROR    conf - ../src/lxc/conf.c:lxc_setup:4444 - Failed to run mount hooks
lxc apps1 20250601230649.755 ERROR    start - ../src/lxc/start.c:do_start:1272 - Failed to setup container "apps1"
lxc apps1 20250601230649.755 ERROR    sync - ../src/lxc/sync.c:sync_wait:34 - An error occurred in another process (expected sequence number 4)
lxc apps1 20250601230649.765 WARN     network - ../src/lxc/network.c:lxc_delete_network_priv:3631 - Failed to rename interface with index 0 from "eth0" to its initial name "veth14294c35"
lxc apps1 20250601230649.765 ERROR    lxccontainer - ../src/lxc/lxccontainer.c:wait_on_daemonized_start:878 - Received container state "ABORTING" instead of "RUNNING"
lxc apps1 20250601230649.765 ERROR    start - ../src/lxc/start.c:__lxc_start:2107 - Failed to spawn container "apps1"
lxc apps1 20250601230649.765 WARN     start - ../src/lxc/start.c:lxc_abort:1036 - No such process - Failed to send SIGKILL via pidfd 17 for process 41643

Going to have to roll back again… Something is amiss.

dasunsrule32 · June 8, 2025, 7:44pm

Ok, took some time to look at more debugging with my nvidia containers not starting on 25.04.1.

lxc apps1 20250608194133.158 ERROR    conf - ../src/lxc/conf.c:run_buffer:322 - Script exited with status 1
lxc apps1 20250608194133.159 ERROR    conf - ../src/lxc/conf.c:lxc_setup:4444 - Failed to run mount hooks
lxc apps1 20250608194133.159 ERROR    start - ../src/lxc/start.c:do_start:1272 - Failed to setup container "apps1"
lxc apps1 20250608194133.160 ERROR    sync - ../src/lxc/sync.c:sync_wait:34 - An error occurred in another process (expected sequence number 4)
lxc apps1 20250608194133.269 WARN     network - ../src/lxc/network.c:lxc_delete_network_priv:3631 - Failed to rename interface with index 0 from "eth0" to its initial name "veth234e63de"
lxc apps1 20250608194133.270 ERROR    lxccontainer - ../src/lxc/lxccontainer.c:wait_on_daemonized_start:878 - Received container state "ABORTING" instead of "RUNNING"
lxc apps1 20250608194133.270 ERROR    start - ../src/lxc/start.c:__lxc_start:2107 - Failed to spawn container "apps1"
lxc apps1 20250608194133.271 WARN     start - ../src/lxc/start.c:lxc_abort:1036 - No such process - Failed to send SIGKILL via pidfd 17 for process 28854

This stuck out to me in the logs:

lxc apps1 20250608194133.160 ERROR    sync - ../src/lxc/sync.c:sync_wait:34 - An error occurred in another process (expected sequence number 4)

I can confirm that when disabling the nvidia incus configs, both containers start.

nvidia.driver.capabilities: compute,graphics,utility,video
nvidia.runtime: "true"

For giggles, I tried removing my GPU from the instance and now I get the following when attempting to add the GPU back:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/middlewared/api/base/server/ws_handler/rpc.py", line 323, in process_method_call
    result = await method.call(app, params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/api/base/server/method.py", line 40, in call
    result = await self.middleware.call_with_audit(self.name, self.serviceobj, methodobj, params, app)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 906, in call_with_audit
    result = await self._call(method, serviceobj, methodobj, params, app=app,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 715, in _call
    return await methodobj(*prepared_call.args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/api/base/decorator.py", line 91, in wrapped
    args = list(args[:args_index]) + accept_params(accepts, args[args_index:])
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/api/base/handler/accept.py", line 23, in accept_params
    dump = validate_model(model, args_as_dict, exclude_unset=exclude_unset, expose_secrets=expose_secrets)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/api/base/handler/accept.py", line 80, in validate_model
    raise verrors from None
middlewared.service_exception.ValidationErrors: [EINVAL] device.GPU.gpu_type: Field required

I’m going to rollback again, as it’s definitely GPU related. I’m not sure that it is nvidia specific though with the error above…

Any ideas @awalkerix ? Should I open a bug report up with this info? Seems like a bug to me. Thanks.

dasunsrule32 · June 9, 2025, 7:55pm

I just tested this on 25.04.0 same exact issue…

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/middlewared/api/base/server/ws_handler/rpc.py", line 323, in process_method_call
    result = await method.call(app, params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/api/base/server/method.py", line 40, in call
    result = await self.middleware.call_with_audit(self.name, self.serviceobj, methodobj, params, app)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 883, in call_with_audit
    result = await self._call(method, serviceobj, methodobj, params, app=app,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 692, in _call
    return await methodobj(*prepared_call.args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/api/base/decorator.py", line 86, in wrapped
    args = list(args[:args_index]) + accept_params(accepts, args[args_index:])
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/api/base/handler/accept.py", line 23, in accept_params
    dump = validate_model(model, args_as_dict, exclude_unset=exclude_unset, expose_secrets=expose_secrets)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/api/base/handler/accept.py", line 80, in validate_model
    raise verrors from None
middlewared.service_exception.ValidationErrors: [EINVAL] device.GPU.gpu_type: Field required

Doing some more testing…

dasunsrule32 · June 9, 2025, 8:16pm

It doesn’t work with cloud-init created or UI created instances.

LarsR · June 9, 2025, 8:59pm

did you check the nvildia-toolkit version on the host or just the lxc? My guess is that the toolkit version on the host got updated to 1.17.7 with 25.04.1 and when you try to pass it through to the lxc it errors out because i didn’t have the toolkit installed in my lxc and it still errors out

dasunsrule32 · June 9, 2025, 9:11pm

It’s not the toolkit. This is a TNCE issue it seems. My containers start without issue on 25.04.0, it fails on 25.04.1.

It fails adding the GPU on either version, but at least works on 25.04.0.

EDIT: Sorry, I missed the host part… I am not on 25.04.1 currently so I can’t verify the tookit version on the host.

EDIT2: Toolkit shouldn’t matter on the host UNLESS you’re using the TNCE apps (docker) functionality. I’m not using that, I’m using a container with toolkit 1.17.8 installed in it and working fine on 25.04.0.

dasunsrule32 · June 16, 2025, 6:35pm

So you can create a new container and attach the GPU, but adding a GPU to an existing container will fail with these errors. Also, removing from an existing container and attempting to add it back will fail with the same error.

nvidia container instances still fail to start on 25.04.1.

dasunsrule32 · June 16, 2025, 7:03pm

But to confirm your question, it is 1.17.7 on 25.04.1.