Ok, took some time to look at more debugging with my nvidia containers not starting on 25.04.1
.
lxc apps1 20250608194133.158 ERROR conf - ../src/lxc/conf.c:run_buffer:322 - Script exited with status 1
lxc apps1 20250608194133.159 ERROR conf - ../src/lxc/conf.c:lxc_setup:4444 - Failed to run mount hooks
lxc apps1 20250608194133.159 ERROR start - ../src/lxc/start.c:do_start:1272 - Failed to setup container "apps1"
lxc apps1 20250608194133.160 ERROR sync - ../src/lxc/sync.c:sync_wait:34 - An error occurred in another process (expected sequence number 4)
lxc apps1 20250608194133.269 WARN network - ../src/lxc/network.c:lxc_delete_network_priv:3631 - Failed to rename interface with index 0 from "eth0" to its initial name "veth234e63de"
lxc apps1 20250608194133.270 ERROR lxccontainer - ../src/lxc/lxccontainer.c:wait_on_daemonized_start:878 - Received container state "ABORTING" instead of "RUNNING"
lxc apps1 20250608194133.270 ERROR start - ../src/lxc/start.c:__lxc_start:2107 - Failed to spawn container "apps1"
lxc apps1 20250608194133.271 WARN start - ../src/lxc/start.c:lxc_abort:1036 - No such process - Failed to send SIGKILL via pidfd 17 for process 28854
This stuck out to me in the logs:
lxc apps1 20250608194133.160 ERROR sync - ../src/lxc/sync.c:sync_wait:34 - An error occurred in another process (expected sequence number 4)
I can confirm that when disabling the nvidia
incus configs, both containers start.
nvidia.driver.capabilities: compute,graphics,utility,video
nvidia.runtime: "true"
For giggles, I tried removing my GPU from the instance and now I get the following when attempting to add the GPU back:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/middlewared/api/base/server/ws_handler/rpc.py", line 323, in process_method_call
result = await method.call(app, params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/api/base/server/method.py", line 40, in call
result = await self.middleware.call_with_audit(self.name, self.serviceobj, methodobj, params, app)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 906, in call_with_audit
result = await self._call(method, serviceobj, methodobj, params, app=app,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 715, in _call
return await methodobj(*prepared_call.args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/api/base/decorator.py", line 91, in wrapped
args = list(args[:args_index]) + accept_params(accepts, args[args_index:])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/api/base/handler/accept.py", line 23, in accept_params
dump = validate_model(model, args_as_dict, exclude_unset=exclude_unset, expose_secrets=expose_secrets)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/api/base/handler/accept.py", line 80, in validate_model
raise verrors from None
middlewared.service_exception.ValidationErrors: [EINVAL] device.GPU.gpu_type: Field required
I’m going to rollback again, as it’s definitely GPU related. I’m not sure that it is nvidia
specific though with the error above…
Any ideas @awalkerix ? Should I open a bug report up with this info? Seems like a bug to me. Thanks. 