Setting Up NVIDIA GRID Drivers for vGPU Passthrough on TrueNAS SCALE

Hello, after searching for guides on how to get vGPU working on TrueNAS without finding much, I started experimenting and eventually managed to make it work. Here’s how I did it:

WARNING: What I did is not officially supported, and I’m far from an expert. You risk breaking your system by following these steps. Proceed with caution. If you plan to try this yourself, make sure to back up your configuration first, and consider testing everything in a temporary VM before applying it to your main system.

After setting up vGPU on Proxmox and allocating a portion of my Tesla P4 to my TrueNAS SCALE VM, I noticed that my GPU wasn’t recognized. Running nvidia-smi returned the following output: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.I also saw the following messages during boot:
[ 5365.534389] NVRM: The NVIDIA GPU 0000:02:00.0 (PCI ID: 10de:ibb3) [ 5365.534389] NVRM: installed in this system is not supported by the [ 5365.534389] NVRM: NVIDIA 550.127.05 driver release. [ 5365.534389] NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products' [ 5365.534389] NVRM: in this release's README, available on the operating system [ 5365.534389] NVRM: specific graphics driver download page at www.nvidia.com. [ 5365.537097] nvidia: probe of 0000:02:00.0 failed with error -1 [ 5365.537355] NVRM: The NVIDIA probe routine failed for 1 device(s). [ 5365.537535] NVRM: None of the NVIDIA devices were initialized. [ 5366.332187] [ 5366.335732] NVRM: The NVIDIA GPU 0000:02:00.0 (PCI ID: 10de:ibb3) [ 5366.335732] NVRM: installed in this system is not supported by the [ 5366.335732] NVRM: NVIDIA 550.127.05 driver release. [ 5366.335732] NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products' [ 5366.335732] NVRM: in this release's README, available on the operating system [ 5366.335732] NVRM: specific graphics driver download page at www.nvidia.com. [ 5366.338884] nvidia: probe of 0000:02:00.0 failed with error -1 [ 5366.339198] NVRM: The NVIDIA probe routine failed for 1 device(s). [ 5366.339435] NVRM: None of the NVIDIA devices were initialized.

So, I decided to take my chances and manually install the required drivers by following these steps:

  1. I first disabled the default NVIDIA drivers from the app settings (configuration->settings->uncheck “install nvidia drivers”) and rebooted the system

  2. I enabled developer mode by running install-dev-tools (see: Developer Mode (Unsupported) | TrueNAS Documentation Hub)

  3. I downloaded and installed the Nvidia 535 GRID drivers following these instructions: Proxmox vGPU - v3 - wvthoog.nl (important: don’t download the driver files to your home directory — they won’t be runnable. I saved them in my main pool instead.)

After that, I encountered an error about an unset UUID, which I fixed thanks to this forum post: Docker Apps and UUID issue with NVIDIA GPU after upgrade to 24.10; Once that was resolved, I got another error when starting the container: [EFAULT] Failed 'up' action for 'jellyfin' app. Please check /var/log/app_lifecycle.log for more details. In the app_lifecycle.log file i found the following error: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #1: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown that I fixed by installing the NVIDIA Container Toolkit, following this guide: Installing the NVIDIA Container Toolkit — NVIDIA Container Toolkit. After completing all these steps and restarting Docker, everything started working perfectly.

The following day after a vgpu profile change and a restart I started getting the same Failed ‘up’ error but with the following error in the logs: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #1: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: device error: GPU-59a463e2-2821-11f0-a2f6-d880bd3f90c1: unknown device: unknown. The solution was the same as for the unset UUID error.

Final disclaimer: Forgive me for any possible grammatical errors, and don’t blame me if your system implodes. Have fun!

If you know a better and especially more supported way of achieving this, please let me know.

I’ve noticed that when you reboot your TrueNAS instance the GPU UUID changes and as a result the GPU gets deselected from your container after fixing the UUID issue. To avoid wasting time, remember to re-check your GPU selection at the bottom of the container settings. I’m considering writing a script that resets the UUID and reselects the GPU automatically after every reboot; if I do end up writing it, I’ll post it here

I’ve also noticed that transcoding performance occasionally drops significantly, often down to just 2-3 fps. Rebooting the system resolves it for now, but I plan to investigate further. If the problem persists, I’ll try passing through the entire GPU to determine whether it’s related to the vGPU setup or the GPU itself.

well hell yeah, i tried installing nvidia grid drivers so many times over the last 6mo and failed - why i always was running from the ~ dir

i will have to give this ago with my 2080ti patched drivers…

thanks man!

Have fun!

1 Like

sames issues as I have always had

truenas_admin@truenas:~$ sudo install-dev-tools
[sudo] password for truenas_admin: 
+ FORCE_ARG=
+ [[ '' == \-\-\f\o\r\c\e ]]
+ [[ ! -S /var/run/middleware/middlewared.sock ]]
+ PACKAGES=(make open-iscsi python3-cryptography python3-pip python3-pyfakefs python3-pyotp python3-pytest python3-pytest-asyncio python3-pytest-dependency python3-pytest-rerunfailures python3-pytest-timeout snmp sshpass zstd)
+ PIP_PACKAGES=()
+ '[' -f /usr/local/libexec/disable-rootfs-protection ']'
+ /usr/local/libexec/disable-rootfs-protection
Flagging root dataset as developer mode
Setting readonly=off on dataset boot-pool/ROOT/25.04.0/opt
Setting readonly=off on dataset boot-pool/ROOT/25.04.0/usr
/usr/bin/apt-sortpkgs: setting 0o755 on file.
Traceback (most recent call last):
  File "/usr/local/libexec/disable-rootfs-protection", line 151, in <module>
    chmod_files()
  File "/usr/local/libexec/disable-rootfs-protection", line 82, in chmod_files
    os.chmod(entry.path, new_mode)
OSError: [Errno 30] Read-only file system: '/usr/bin/apt-sortpkgs'

and running driver install

truenas_admin@truenas:/mnt/fast/nvidia/NVIDIA-GRID-Linux-KVM-570.133.10-570.133.20-572.83/Guest_Drivers$ sudo ./NVIDIA-Linux-x86_64-570.133.20-grid.run --dkm
s
sudo: process 9137 unexpected status 0x57f
ERROR: Temporary directory /tmp is not executable - use the  --tmpdir option to specify a different one.
truenas_admin@truenas:/mnt/fast/nvidia/NVIDIA-GRID-Linux-KVM-570.133.10-570.133.20-572.83/Guest_Drivers$ 

even if i make the dir i make in the pool owned by root and executable i get this

truenas_admin@truenas:/mnt/fast/nvidia/NVIDIA-GRID-Linux-KVM-570.133.10-570.133.20-572.83/Guest_Drivers$ sudo ./NVIDIA-Linux-x86_64-570.133.20-grid.run --dkms --tmpdir  /mnt/fast/nvidia/tmp/
sudo: process 9700 unexpected status 0x57f
ERROR: Temporary directory /mnt/fast/nvidia/tmp/ is not executable - use the  --tmpdir option to specify a different one.
truenas_admin@truenas:/mnt/fast/nvidia/NVIDIA-GRID-Linux-KVM-570.133.10-570.133.20-572.83/Guest_Drivers$ ls -la /mnt/fast/nvidia/
total 50
drwxrwxrwx 4 root root  4 May 10 13:11 .
drwxr-xr-x 3 root root  3 May 10 12:55 ..
drwxrwxrwx 5 root root 12 May 10 12:56 NVIDIA-GRID-Linux-KVM-570.133.10-570.133.20-572.83
drwxrwxrwx 2 root root  2 May 10 13:16 tmp
truenas_admin@truenas:/mnt/fast/nvidia/NVIDIA-GRID-Linux-KVM-570.133.10-570.133.20-572.83/Guest_Drivers$ 

i even set the tmpdir like this and it doesn’t help, i can get past this error by not using sudo, but then the installer UI tells me i need to be root

Ok, i got futher - some of my issues might be because i am using a cloned boot-pool (i never want to modify base while experiementig.

Here is how to enable dev mode tools

  1. clone and active new boot pool, give it a name you won’t forget
  2. reboot
  3. ssh in
sudo zfs set readonly=off boot-pool/ROOT/<clone-boot-pool-name>
sudo zfs set readonly=off boot-pool/ROOT/<clone-boot-pool-name>conf
sudo zfs set readonly=off boot-pool/ROOT/<clone-boot-pool-name>/opt
sudo zfs set readonly=off boot-pool/ROOT/<clone-boot-pool-name>/usr
  1. now it is possible to sudo install-dev-tools
  2. I still have issues with running the nvidia .run file as sudo from one of my pools, no matter what I do it says that the tmp dir is not executable
  3. i extracted the .run succesfully and then ran the nvidia installer by hand with --dkms but it fails compilation

i have no idea how op was able to run any of this on scale… could you maybe post explicit and detailed instructions of what you did?

I forgot to mention that you need to specify a temporary directory. I don’t remember the full command exactly, but I think I created a tmp directory inside my main pool and used it with the --tmpdir /mnt/path/to/my/tmp option. Do you know of any services that would let me create a guide I can edit later? I can’t edit my original post anymore

i create all my guides on github gists

also i get caught in a catch22 error with the command with --tmpdir (i already created a pool in a location with penty of perms) if i run as truenas_admin the install extracts and runs but then tells me i need to be root

when i sudo the command or run as actual root it gives me the other errors

are you on 25.04 or some older install / old install upgraded to 25.04 because this reall doesn’t work…

like so, you can see the tmpdir option works at truenas_admin but not as sudo

truenas_admin@truenas[/mnt/fast/nvidia]$ sudo ./NVIDIA-Linux-x86_64-535.230.02-grid.run --dkms --tmpdir /mnt/fast/nvidia/tmp/
sudo: process 9470 unexpected status 0x57f

ERROR: Temporary directory /mnt/fast/nvidia/tmp/ is not executable - use the  --tmpdir option to specify a different one.


truenas_admin@truenas[/mnt/fast/nvidia]$  ./NVIDIA-Linux-x86_64-535.230.02-grid.run --dkms --tmpdir /mnt/fast/nvidia/tmp/ 

Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 535.230.02........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
nvidia-installer: Error opening log file '/var/log/nvidia-installer.log' for writing (Permission denied); disabling logging.
truenas_admin@truenas[/mnt/fast/nvidia]$ 

I’m running version 24.10, but I don’t think that’s the issue. When I run

root@truenas[/mnt/main]# ./NVIDIA-Linux-x86_64-535.161.07-grid.run --dkms --tmpdir /mnt/main/test

the drivers install without any issues. I suspect the problem might be related to the /var/log/ permissions. In my case, the permissions are as follows:

root@truenas[~]# ls -lh /var/log/nvidia-installer.log -rw-r--r-- 1 root root 731 May 12 18:03 /var/log/nvidia-installer.log

nice suggestion, no cigar

truenas_admin@truenas[~]$ ls -lh /var/log/nvidia-installer.log
-rw-r–r-- 1 root root 9.8K May 10 13:59 /var/log/nvidia-installer.log

FYI i had same issues on 24.x series fresh installs before - i have been at this for over 6mo…

When you enabled dev mode what else did you have configured vs not confgured (apps service, domain join, etc)? I know that for example one can;t enable dev mode while the app service is configured as an example. but there it fails completely. Thats not my issue (just using it as an example).

Also I assure root / user with sudo is actively blocked from executing .run package by the kernel - i used strace to prove that

truenas_admin@truenas[/mnt/fast/nvidia]$ sudo strace ./NVIDIA-Linux-x86_64-535.230.02-grid.run
strace: test_ptrace_get_syscall_info: PTRACE_TRACEME: Operation not permitted
strace: ptrace(PTRACE_TRACEME, ...): Operation not permitted
+++ exited with 1 +++

what happens if you install strace and do the same? (ie sudo and root get blocked - they are not true super user)

lastly are you sure you didn’t run some other commands after installing dev mode? did you modify the groups truenas_admin is in? are you using a different account (not root or the default admin)?

I’m pretty sure I didn’t run any other commands. This weekend I’ll try to repeat the process in a test VM and write a detailed guide with every step I took.

1 Like