Electric Eel 24.10.002 GPU Passthrough Troubleshooting

Maximilious · December 10, 2024, 12:39am

Hi All,

I installed a new RTX 3050 in my server today and wanted to pass through to my Plex app (official IX channel), but the option for a GPU is missing. I understand this is currently not working in 24.10 but wanted to help troubleshoot if my situation helps the devs at all. I did not have a previous GPU installed on my system in previous Dragonfish release.

Here’s what I’ve uncovered so far:

I verified the GPU can be found in the Isolated GPU devices section in Advanced Settings (but did not isolate for VM use).
nvidia-smi shows no output at all on the Truenas console and I cannot select the card from my Plex Application settings.
I enabled the Nvidia Drivers checkmark under Application Settings but I get an error that it cannot install. The log file is very large but I will include the end portion of the log and can provide more if needed.
I found this article on the issue, but my server does not show a slot occupied by the GPU - perhaps because I did enable Isolation but then disabled it after reading it should not be isolated. I have not rebooted after removing from isolation and cannot try this workaround due to lack of PCI slot info being populated from midctl output.
Docker Apps and UUID issue with NVIDIA GPU after upgrade to 24.10
Output of midctl:

admin@dj-truenas[~]$ midclt call app.gpu_choices | jq
{
  "0000:03:00.0": {
    "vendor": null,
    "description": "ASPEED Technology, Inc. ASPEED Graphics Family",
    "vendor_specific_config": {},
    "pci_slot": "0000:03:00.0"
  }
}

End of Nvidia log file:

make[1]: Leaving directory '/usr/src/linux-headers-6.6.44-production+truenas'
-> done.
-> Kernel module compilation complete.
-> Unable to determine if Secure Boot is enabled: No such file or directory
ERROR: Unable to load the kernel module 'nvidia.ko'.  This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version >

Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information.
-> Kernel module load error: No such device
-> Kernel messages:
[   86.666076] [6182]: iscsi-scst: Negotiated parameters: InitialR2T No, ImmediateData Yes, MaxConnections 1, MaxRecvDataSegmentLength 1048576, MaxXmitDataSegmentLength 262144,
[   86.666079] [6182]: iscsi-scst:     MaxBurstLength 1048576, FirstBurstLength 65536, DefaultTime2Wait 0, DefaultTime2Retain 0,
[   86.666081] [6182]: iscsi-scst:     MaxOutstandingR2T 1, DataPDUInOrder Yes, DataSequenceInOrder Yes, ErrorRecoveryLevel 0,
[   86.666083] [6182]: iscsi-scst:     HeaderDigest None, DataDigest None, OFMarker No, IFMarker No, OFMarkInt 2048, IFMarkInt 2048, RDMAExtensions No
[   86.666086] [6182]: iscsi-scst: Target parameters set for session 30100003d0200: QueuedCommands 32, Response timeout 90, Nop-In interval 30, Nop-In timeout 30
[   86.666125] [6182]: iscsi-scst: Creating connection 000000005fec937b for sid 0x30100003d0200, cid 0 (initiator iqn.1993-08.org.debian:01:286ad6c18d4f#172.16.15.2)
[   87.493376] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
[   87.494540] Bridge firewalling registered
[   87.687818] Initializing XFRM netlink socket
[ 1495.195961] loop0: detected capacity change from 0 to 2349400
[ 1495.203630] squashfs: version 4.0 (2009/01/31) Phillip Lougher
[ 1495.520656] loop0: detected capacity change from 0 to 2349400
[ 1767.279169] VFIO - User Level meta-driver version: 0.3
[ 1767.284080] vfio-pci 0000:65:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=io+mem:owns=none
[ 3001.103520] nvidia-nvlink: Nvlink Core is being initialized, major device number 510
[ 3001.103527] NVRM: GPU 0000:65:00.0 is already bound to vfio-pci.
[ 3001.105773] NVRM: The NVIDIA probe routine was not called for 1 device(s).
[ 3001.105774] NVRM: This can occur when another driver was loaded and
               NVRM: obtained ownership of the NVIDIA device(s).
[ 3001.105775] NVRM: Try unloading the conflicting kernel module (and/or
               NVRM: reconfigure your kernel without the conflicting
               NVRM: driver(s)), then try loading the NVIDIA kernel module
               NVRM: again.
[ 3001.105776] NVRM: No NVIDIA devices probed.
[ 3001.105981] nvidia-nvlink: Unregistered Nvlink Core, major device number 510
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linu>

If more information is needed, please let me know so I can provide it.

HoneyBadger · December 11, 2024, 2:50pm

Hey @Maximilious

This seems to indicate that the GPU is isolated. It isn’t listed under the System → Advanced screen, such that the pane looks like the below?

If that’s correct then let’s gather a debug.

Maximilious · December 11, 2024, 9:37pm

Thanks - Thanks correct. I removed the GPU from isolation after previously adding it. I have not rebooted after that operation - Let me try that tonight as a first step.

If I still have the same issue, how can I gather a debug for you?

DjP-iX · December 11, 2024, 9:49pm

Gather one before rebooting, in case there is anything useful in the logs that might be lost on reboot. And then a second one after reboot, if necessary.

Maximilious · December 11, 2024, 11:03pm

Ah, sorry I ended up rebooting and now the card is showing up properly. I can likely recreate if you want by putting the card in isolation and then removing it again.

EDIT: I may not be able to recreate, as the Nvidia driver checkbox was already checked in the App Settings section and nvidia-smi is able to be run on the local console now. I also see the GPU as available under the plex container for selection, but I am still getting an error when trying to enable it. It looks like it is the error expected from the previous post I linked to.

Outputs below:

admin@dj-truenas[~]$ midclt call app.gpu_choices | jq
{
  "0000:03:00.0": {
    "vendor": null,
    "description": "ASPEED Technology, Inc. ASPEED Graphics Family",
    "vendor_specific_config": {},
    "pci_slot": "0000:03:00.0"
  },
  "0000:65:00.0": {
    "vendor": "NVIDIA",
    "description": "NVIDIA GeForce RTX 3050",
    "vendor_specific_config": {
      "uuid": "GPU-76232524-a038-ca19-8589-2526db8ab1f2"
    },
    "pci_slot": "0000:65:00.0"
  }
}

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/middlewared/job.py", line 488, in run
    await self.future
  File "/usr/lib/python3/dist-packages/middlewared/job.py", line 535, in __run_body
    rv = await self.middleware.run_in_thread(self.method, *args)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1364, in run_in_thread
    return await self.run_in_executor(io_thread_pool_executor, method, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1361, in run_in_executor
    return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/service/crud_service.py", line 268, in nf
    rv = func(*args, **kwargs)
         ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/schema/processor.py", line 55, in nf
    res = f(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/schema/processor.py", line 183, in nf
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/plugins/apps/crud.py", line 287, in do_update
    app = self.update_internal(job, app, data, trigger_compose=app['state'] != 'STOPPED')
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/plugins/apps/crud.py", line 317, in update_internal
    update_app_config(app_name, app['version'], new_values, custom_app=app['custom_app'])
  File "/usr/lib/python3/dist-packages/middlewared/plugins/apps/ix_apps/lifecycle.py", line 59, in update_app_config
    render_compose_templates(
  File "/usr/lib/python3/dist-packages/middlewared/plugins/apps/ix_apps/lifecycle.py", line 50, in render_compose_templates
    raise CallError(f'Failed to render compose templates: {cp.stderr}')
middlewared.service_exception.CallError: [EFAULT] Failed to render compose templates: Traceback (most recent call last):
  File "/usr/bin/apps_render_app", line 33, in <module>
    sys.exit(load_entry_point('apps-validation==0.1', 'console_scripts', 'apps_render_app')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/catalog_templating/scripts/render_compose.py", line 47, in main
    render_templates_from_path(args.path, args.values)
  File "/usr/lib/python3/dist-packages/catalog_templating/scripts/render_compose.py", line 19, in render_templates_from_path
    rendered_data = render_templates(
                    ^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/catalog_templating/render.py", line 36, in render_templates
    ).render({'ix_lib': template_libs, 'values': test_values})
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/jinja2/environment.py", line 1301, in render
    self.environment.handle_exception()
  File "/usr/lib/python3/dist-packages/jinja2/environment.py", line 936, in handle_exception
    raise rewrite_traceback_stack(source=source)
  File "/mnt/.ix-apps/app_configs/plex/versions/1.1.5/templates/docker-compose.yaml", line 3, in top-level template code
    {% set c1 = tpl.add_container(values.consts.plex_container_name, values.plex.image_selector) %}
    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/.ix-apps/app_configs/plex/versions/1.1.5/templates/library/base_v2_1_0/render.py", line 53, in add_container
    container = Container(self, name, image)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/.ix-apps/app_configs/plex/versions/1.1.5/templates/library/base_v2_1_0/container.py", line 68, in __init__
    self.deploy: Deploy = Deploy(self._render_instance)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/.ix-apps/app_configs/plex/versions/1.1.5/templates/library/base_v2_1_0/deploy.py", line 15, in __init__
    self.resources: Resources = Resources(self._render_instance)
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/.ix-apps/app_configs/plex/versions/1.1.5/templates/library/base_v2_1_0/resources.py", line 24, in __init__
    self._auto_add_gpus_from_values()
  File "/mnt/.ix-apps/app_configs/plex/versions/1.1.5/templates/library/base_v2_1_0/resources.py", line 55, in _auto_add_gpus_from_values
    raise RenderError(f"Expected [uuid] to be set for GPU in slot [{pci}] in [nvidia_gpu_selection]")
base_v2_1_0.error.RenderError: Expected [uuid] to be set for GPU in slot [0000:65:00.0] in [nvidia_gpu_selection]

Maximilious · December 12, 2024, 1:25am

Since the drivers were automatically installed, I followed the steps from the other article and the GPU is now tied to the container, but I seem to be in a worse spot with hardware transcoding than I was with my Xeon performing it in software. All of my transcoded streams end up as 400x200 resolution now.

Edit: I was passing the variables incorrectly. After passing them I’m getting better results. Unless you want me to try and re-produce the issue and gather logs, we can close this one out.

I̶ ̶c̶h̶e̶c̶k̶e̶d̶ ̶t̶h̶e̶ ̶w̶e̶b̶ ̶a̶n̶d̶ ̶a̶m̶ ̶n̶o̶w̶ ̶p̶a̶s̶s̶i̶n̶g̶ ̶t̶h̶e̶ ̶f̶o̶l̶l̶o̶w̶i̶n̶g̶ ̶v̶a̶r̶i̶a̶b̶l̶e̶s̶ ̶t̶o̶ ̶t̶h̶e̶ ̶c̶o̶n̶t̶a̶i̶n̶e̶r̶,̶ ̶b̶u̶t̶ ̶I̶’̶m̶ ̶s̶t̶i̶l̶l̶ ̶h̶a̶v̶i̶n̶g̶ ̶t̶r̶a̶n̶s̶c̶o̶d̶i̶n̶g̶ ̶i̶s̶s̶u̶e̶s̶ ̶a̶f̶t̶e̶r̶ ̶s̶e̶l̶e̶c̶t̶i̶n̶g̶ ̶t̶h̶e̶ ̶3̶0̶5̶0̶ ̶G̶P̶U̶ ̶i̶n̶ ̶T̶r̶u̶e̶n̶a̶s̶ ̶c̶o̶n̶f̶i̶g̶u̶r̶a̶t̶i̶o̶n̶ ̶a̶n̶d̶ ̶t̶h̶e̶ ̶P̶l̶e̶x̶ ̶G̶U̶I̶:̶

Passed Env Variables:
NVIDIA_GPU_CAPABILITIES = ALL
NVIDIA_DRIVER_CAPABILITIES=compute,video,utility

TrueNAS GPU Option:

Transcoding settings in Plex GUI:

This may be something for the Plex team to handle at this point if you feel everything is set correctly.

DjP-iX · December 12, 2024, 2:09pm

I’m glad to hear it’s working better. The plex team would know better than me whether there is any improvements you can make to those settings or environment variables.

I don’t think we still need to gather a debug at this point, since I suspect needing to reboot after un-isolating the GPU is working as expected.