Scale EE 24.10.0.2 - Tesla A2 driver installation failure

trvldsismyspiritanml · November 21, 2024, 5:50am

In line with existing NAS-131709 nvidia datacenter graphics card lines like tesla A2 do not appear to be compatible with the driverset downloaded by scale EE 24.10.0.2.

This is the only entry shown with the “midclt call app.gpu_choices | jq”


{
  "0000:01:00.1": {
    "vendor": null,
    "description": "Matrox Electronics Systems Ltd. MGA G200EH",
    "vendor_specific_config": {},
    "pci_slot": "0000:01:00.1"
  }
}

however an “lspci -vv” gives:

22:00.0 3D controller: NVIDIA Corporation GA107GL [A2 / A16] (rev a1)
        Subsystem: NVIDIA Corporation GA107GL [A2 / A16]
        Physical Slot: 5
        Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Interrupt: pin A routed to IRQ 144
        NUMA node: 1
        IOMMU group: 19
        Region 3: Memory at f6000000 (64-bit, prefetchable) [size=32M]
        Capabilities: [60] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] Null
        Capabilities: [78] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0W
                DevCtl: CorrErr- NonFatalErr+ FatalErr+ UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 256 bytes, MaxReadReq 4096 bytes
                DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
                LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM not supported
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s (downgraded), Width x8 (downgraded)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR+
                         10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
                LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
                LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [b4] Vendor Specific Information: Len=14 <?>
        Capabilities: [c8] MSI-X: Enable- Count=6 Masked-
                Vector table: BAR=0 offset=00b90000
                PBA: BAR=0 offset=00ba0000
        Capabilities: [100 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
                        Status: NegoPending- InProgress-
        Capabilities: [250 v1] Latency Tolerance Reporting
                Max snoop latency: 0ns
                Max no snoop latency: 0ns
        Capabilities: [258 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                           T_CommonMode=0us LTR1.2_Threshold=0ns
                L1SubCtl2: T_PwrOn=10us
        Capabilities: [128 v1] Power Budgeting <?>
        Capabilities: [420 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                UESvrt: DLP- SDES+ TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                LaneErrStat: 0
        Capabilities: [bb0 v1] Physical Resizable BAR
                BAR 0: current size: 16MB, supported: 16MB
                BAR 1: current size: 16GB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB
                BAR 3: current size: 32MB, supported: 32MB
        Capabilities: [bcc v1] Single Root I/O Virtualization (SR-IOV)
                IOVCap: Migration- 10BitTagReq+ Interrupt Message Number: 000
                IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+ 10BitTagReq-
                IOVSta: Migration-
                Initial VFs: 16, Total VFs: 16, Number of VFs: 0, Function Dependency Link: 00
                VF offset: 4, stride: 1, Device ID: 25b6
                Supported Page Size: 00000573, System Page Size: 00000001
                Region 1: Memory at 0000000000000000 (64-bit, prefetchable)
                Region 3: Memory at 0000000000000000 (64-bit, prefetchable)
                VF Migration: offset: 00000000, BIR: 0
        Capabilities: [c14 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 0
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [c1c v1] Physical Layer 16.0 GT/s <?>
        Capabilities: [d00 v1] Lane Margining at the Receiver <?>
        Capabilities: [e00 v1] Data Link Feature <?>
        Kernel modules: nouveau

This is thrown in when looking into logging:

Nov 20 21:05:43 servername kernel: VFIO - User Level meta-driver version: 0.3
Nov 20 21:05:44 servername kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 237
Nov 20 21:05:44 servername kernel:
Nov 20 21:05:44 servername kernel: nvidia 0000:22:00.0: enabling device (0040 -> 0042)
Nov 20 21:05:44 servername kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
                              NVRM: BAR0 is 0M @ 0x0 (PCI:0000:22:00.0)
Nov 20 21:05:44 servername kernel: nvidia: probe of 0000:22:00.0 failed with error -1
Nov 20 21:05:44 servername kernel: NVRM: The NVIDIA probe routine failed for 1 device(s).
Nov 20 21:05:44 servername kernel: NVRM: None of the NVIDIA devices were initialized.
Nov 20 21:05:44 servername kernel: nvidia-nvlink: Unregistered Nvlink Core, major device number 237

With this shown during the failure of the installation via gui:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/middlewared/job.py", line 488, in run
    await self.future
  File "/usr/lib/python3/dist-packages/middlewared/job.py", line 535, in __run_body
    rv = await self.middleware.run_in_thread(self.method, *args)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1364, in run_in_thread
    return await self.run_in_executor(io_thread_pool_executor, method, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1361, in run_in_executor
    return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/plugins/nvidia.py", line 65, in install
    self._install_driver(job, td, path)
  File "/usr/lib/python3/dist-packages/middlewared/plugins/nvidia.py", line 133, in _install_driver
    subprocess.run([path, "--tmpdir", td, "-s"], capture_output=True, check=True, text=True)
  File "/usr/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/root/tmp7jdwg7bo/NVIDIA-Linux-x86_64-550.135-no-compat32.run', '--tmpdir', '/root/tmp7jdwg7bo', '-s']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/middlewared/job.py", line 488, in run
    await self.future
  File "/usr/lib/python3/dist-packages/middlewared/job.py", line 533, in __run_body
    rv = await self.method(*args)
         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/schema/processor.py", line 49, in nf
    res = await f(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/schema/processor.py", line 179, in nf
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/plugins/docker/update.py", line 106, in do_update
    await (
  File "/usr/lib/python3/dist-packages/middlewared/job.py", line 436, in wait
    raise self.exc_info[1]
  File "/usr/lib/python3/dist-packages/middlewared/job.py", line 492, in run
    raise handled
middlewared.service_exception.CallError: [EFAULT] Command /root/tmp7jdwg7bo/NVIDIA-Linux-x86_64-550.135-no-compat32.run --tmpdir /root/tmp7jdwg7bo -s failed (code 1):
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 550.135.....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

ERROR: Unable to load the kernel module 'nvidia.ko'.  This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.

Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information.


ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

Additionally the end of nvidia-installer log gives us:

RROR: Unable to load the kernel module 'nvidia.ko'.  This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.

Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information.
-> Kernel module load error: No such device
-> Kernel messages:
[  725.799302]  ? free_unref_page_prepare+0xbd/0x360
[  725.799314]  ? __count_memcg_events+0x4d/0x90
[  725.799320]  ? count_memcg_events.constprop.0+0x1a/0x30
[  725.799329]  ? handle_mm_fault+0xa2/0x370
[  725.799336]  ? do_user_addr_fault+0x21d/0x630
[  725.799343]  ? exc_page_fault+0x77/0x170
[  725.799350]  entry_SYSCALL_64_after_hwframe+0x78/0xe2
[  725.799361] RIP: 0033:0x7ff7f4a90c5b
[  725.799368] RSP: 002b:00007ffc52052ee0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  725.799374] RAX: ffffffffffffffda RBX: 0000000000005a23 RCX: 00007ff7f4a90c5b
[  725.799378] RDX: 00007ffc52052f60 RSI: 0000000000005a23 RDI: 0000000000000004
[  725.799381] RBP: 00007ffc52056550 R08: 00007ff7f4b66460 R09: 00007ff7f4b66460
[  725.799385] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffc52052f60
[  725.799389] R13: 0000000000005a23 R14: 00007ffc52056501 R15: 00007ffc520566c8
[  725.799399]  </TASK>
[  846.236017] VFIO - User Level meta-driver version: 0.3
[  847.363343] nvidia-nvlink: Nvlink Core is being initialized, major device number 237

[  847.365925] nvidia 0000:22:00.0: enabling device (0040 -> 0042)
[  847.366030] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
               NVRM: BAR0 is 0M @ 0x0 (PCI:0000:22:00.0)
[  847.366043] nvidia: probe of 0000:22:00.0 failed with error -1
[  847.366108] NVRM: The NVIDIA probe routine failed for 1 device(s).
[  847.366111] NVRM: None of the NVIDIA devices were initialized.
[  847.366597] nvidia-nvlink: Unregistered Nvlink Core, major device number 237
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

Taking a look at the version nvidia recommends for this card they do recommend a different version than what is downloaded possibly for more stabilized support/testing as:
URL(ish): nvidia dot com /en-us/drivers/details/236265/
Driver Version: 550.127.08
Release Date: Tue Nov 19, 2024

vs Linux-x86_64-550.135 which was attempting to be installed which in URL(ish): nvidia dot com /en-us/drivers/details/236036/ lists that it seems only compatible with more consumer level cards than the data center line.

Is it possible to load in the “datacenter” line driver sets? Or possibly having the system co-load the ones based on the model information from the card? It appears this driverset may also be the supported release for P4 which others have had issues with recently.

Thanks! Happy to help with any diags (have both the A2 I am having issues with and 2x P4s)

Chassis: HP DL380 gen8
CPU: 2x Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
RAM: 96GB
Storage: RAIDZ2 10wide 2.73TiBs
Boot: Mirror 128GB SSDs

HoneyBadger · November 21, 2024, 5:26pm

Hey @trvldsismyspiritanml

I’ve had mixed experiences with the HP DL380 Gen8 systems and behaving properly with PCIe/NVMe devices.

Nov 20 21:05:44 servername kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
                              NVRM: BAR0 is 0M @ 0x0 (PCI:0000:22:00.0)

Have you enabled Above 4G Decoding, 64-bit BAR, and other advanced features for PCIe cards? The Tesla based cards often require these to be set up.

trvldsismyspiritanml · November 23, 2024, 5:43am

Have you enabled … advanced features for PCIe cards? The Tesla based cards often require these to be set up.

64-bit BAR
Above 4G Decoding
and other

Thanks to other internet strangers it appears there is a “CTRL+A” “secret” menu in the bios to tinker with some additional settings. I was able to get BAR enabled, however it didnt seem to reserve any RAM for itself. It did proceed to install the drivers successfully. Testing some additional settings permutations now.

I wasnt able to find one for “Above 4G Decoding”, and for “other” are there any specifics that you know of or just give some tests a go?

trvldsismyspiritanml · November 23, 2024, 5:46am

NVM Looks like latest round of testing mightve gotten it to be “happy” or it just needed 2x reboots to fully “seat” itself…

I can now see:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.135                Driver Version: 550.135        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A2                      Off |   00000000:24:00.0 Off |                    0 |
|  0%   35C    P0             19W /   60W |       1MiB /  15356MiB |     12%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

trvldsismyspiritanml · November 27, 2024, 10:18pm

In case anyone else needs this, very quick and dirty to enable BAR support on a DL380 Gen8 (With latest available firmware).

Go into BIOS (F9 during boot)
Press “CTRL+A” at the main BIOS menu
Select “PCI Express 64-bit BAR support” and change from “Disabled” to “Enabled”.
ESC a couple of times to get back to main menu and F10 to save.