In line with existing NAS-131709 nvidia datacenter graphics card lines like tesla A2 do not appear to be compatible with the driverset downloaded by scale EE 24.10.0.2.
This is the only entry shown with the “midclt call app.gpu_choices | jq”
{
"0000:01:00.1": {
"vendor": null,
"description": "Matrox Electronics Systems Ltd. MGA G200EH",
"vendor_specific_config": {},
"pci_slot": "0000:01:00.1"
}
}
however an “lspci -vv” gives:
22:00.0 3D controller: NVIDIA Corporation GA107GL [A2 / A16] (rev a1)
Subsystem: NVIDIA Corporation GA107GL [A2 / A16]
Physical Slot: 5
Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 144
NUMA node: 1
IOMMU group: 19
Region 3: Memory at f6000000 (64-bit, prefetchable) [size=32M]
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] Null
Capabilities: [78] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0W
DevCtl: CorrErr- NonFatalErr+ FatalErr+ UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 256 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM not supported
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s (downgraded), Width x8 (downgraded)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR+
10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
AtomicOpsCtl: ReqEn-
LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [b4] Vendor Specific Information: Len=14 <?>
Capabilities: [c8] MSI-X: Enable- Count=6 Masked-
Vector table: BAR=0 offset=00b90000
PBA: BAR=0 offset=00ba0000
Capabilities: [100 v1] Virtual Channel
Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
Status: NegoPending- InProgress-
Capabilities: [250 v1] Latency Tolerance Reporting
Max snoop latency: 0ns
Max no snoop latency: 0ns
Capabilities: [258 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
T_CommonMode=0us LTR1.2_Threshold=0ns
L1SubCtl2: T_PwrOn=10us
Capabilities: [128 v1] Power Budgeting <?>
Capabilities: [420 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UESvrt: DLP- SDES+ TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
LaneErrStat: 0
Capabilities: [bb0 v1] Physical Resizable BAR
BAR 0: current size: 16MB, supported: 16MB
BAR 1: current size: 16GB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB
BAR 3: current size: 32MB, supported: 32MB
Capabilities: [bcc v1] Single Root I/O Virtualization (SR-IOV)
IOVCap: Migration- 10BitTagReq+ Interrupt Message Number: 000
IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+ 10BitTagReq-
IOVSta: Migration-
Initial VFs: 16, Total VFs: 16, Number of VFs: 0, Function Dependency Link: 00
VF offset: 4, stride: 1, Device ID: 25b6
Supported Page Size: 00000573, System Page Size: 00000001
Region 1: Memory at 0000000000000000 (64-bit, prefetchable)
Region 3: Memory at 0000000000000000 (64-bit, prefetchable)
VF Migration: offset: 00000000, BIR: 0
Capabilities: [c14 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [c1c v1] Physical Layer 16.0 GT/s <?>
Capabilities: [d00 v1] Lane Margining at the Receiver <?>
Capabilities: [e00 v1] Data Link Feature <?>
Kernel modules: nouveau
This is thrown in when looking into logging:
Nov 20 21:05:43 servername kernel: VFIO - User Level meta-driver version: 0.3
Nov 20 21:05:44 servername kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 237
Nov 20 21:05:44 servername kernel:
Nov 20 21:05:44 servername kernel: nvidia 0000:22:00.0: enabling device (0040 -> 0042)
Nov 20 21:05:44 servername kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI:0000:22:00.0)
Nov 20 21:05:44 servername kernel: nvidia: probe of 0000:22:00.0 failed with error -1
Nov 20 21:05:44 servername kernel: NVRM: The NVIDIA probe routine failed for 1 device(s).
Nov 20 21:05:44 servername kernel: NVRM: None of the NVIDIA devices were initialized.
Nov 20 21:05:44 servername kernel: nvidia-nvlink: Unregistered Nvlink Core, major device number 237
With this shown during the failure of the installation via gui:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/middlewared/job.py", line 488, in run
await self.future
File "/usr/lib/python3/dist-packages/middlewared/job.py", line 535, in __run_body
rv = await self.middleware.run_in_thread(self.method, *args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1364, in run_in_thread
return await self.run_in_executor(io_thread_pool_executor, method, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1361, in run_in_executor
return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/plugins/nvidia.py", line 65, in install
self._install_driver(job, td, path)
File "/usr/lib/python3/dist-packages/middlewared/plugins/nvidia.py", line 133, in _install_driver
subprocess.run([path, "--tmpdir", td, "-s"], capture_output=True, check=True, text=True)
File "/usr/lib/python3.11/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/root/tmp7jdwg7bo/NVIDIA-Linux-x86_64-550.135-no-compat32.run', '--tmpdir', '/root/tmp7jdwg7bo', '-s']' returned non-zero exit status 1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/middlewared/job.py", line 488, in run
await self.future
File "/usr/lib/python3/dist-packages/middlewared/job.py", line 533, in __run_body
rv = await self.method(*args)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/schema/processor.py", line 49, in nf
res = await f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/schema/processor.py", line 179, in nf
return await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/plugins/docker/update.py", line 106, in do_update
await (
File "/usr/lib/python3/dist-packages/middlewared/job.py", line 436, in wait
raise self.exc_info[1]
File "/usr/lib/python3/dist-packages/middlewared/job.py", line 492, in run
raise handled
middlewared.service_exception.CallError: [EFAULT] Command /root/tmp7jdwg7bo/NVIDIA-Linux-x86_64-550.135-no-compat32.run --tmpdir /root/tmp7jdwg7bo -s failed (code 1):
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 550.135.....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
ERROR: Unable to load the kernel module 'nvidia.ko'. This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.
Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information.
ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
Additionally the end of nvidia-installer log gives us:
RROR: Unable to load the kernel module 'nvidia.ko'. This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.
Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information.
-> Kernel module load error: No such device
-> Kernel messages:
[ 725.799302] ? free_unref_page_prepare+0xbd/0x360
[ 725.799314] ? __count_memcg_events+0x4d/0x90
[ 725.799320] ? count_memcg_events.constprop.0+0x1a/0x30
[ 725.799329] ? handle_mm_fault+0xa2/0x370
[ 725.799336] ? do_user_addr_fault+0x21d/0x630
[ 725.799343] ? exc_page_fault+0x77/0x170
[ 725.799350] entry_SYSCALL_64_after_hwframe+0x78/0xe2
[ 725.799361] RIP: 0033:0x7ff7f4a90c5b
[ 725.799368] RSP: 002b:00007ffc52052ee0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 725.799374] RAX: ffffffffffffffda RBX: 0000000000005a23 RCX: 00007ff7f4a90c5b
[ 725.799378] RDX: 00007ffc52052f60 RSI: 0000000000005a23 RDI: 0000000000000004
[ 725.799381] RBP: 00007ffc52056550 R08: 00007ff7f4b66460 R09: 00007ff7f4b66460
[ 725.799385] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffc52052f60
[ 725.799389] R13: 0000000000005a23 R14: 00007ffc52056501 R15: 00007ffc520566c8
[ 725.799399] </TASK>
[ 846.236017] VFIO - User Level meta-driver version: 0.3
[ 847.363343] nvidia-nvlink: Nvlink Core is being initialized, major device number 237
[ 847.365925] nvidia 0000:22:00.0: enabling device (0040 -> 0042)
[ 847.366030] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI:0000:22:00.0)
[ 847.366043] nvidia: probe of 0000:22:00.0 failed with error -1
[ 847.366108] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 847.366111] NVRM: None of the NVIDIA devices were initialized.
[ 847.366597] nvidia-nvlink: Unregistered Nvlink Core, major device number 237
ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
Taking a look at the version nvidia recommends for this card they do recommend a different version than what is downloaded possibly for more stabilized support/testing as:
URL(ish): nvidia dot com /en-us/drivers/details/236265/
Driver Version: 550.127.08
Release Date: Tue Nov 19, 2024
vs Linux-x86_64-550.135 which was attempting to be installed which in URL(ish): nvidia dot com /en-us/drivers/details/236036/ lists that it seems only compatible with more consumer level cards than the data center line.
Is it possible to load in the “datacenter” line driver sets? Or possibly having the system co-load the ones based on the model information from the card? It appears this driverset may also be the supported release for P4 which others have had issues with recently.
Thanks! Happy to help with any diags (have both the A2 I am having issues with and 2x P4s)
Chassis: HP DL380 gen8
CPU: 2x Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
RAM: 96GB
Storage: RAIDZ2 10wide 2.73TiBs
Boot: Mirror 128GB SSDs