ZFS_ARC_MAX issue - out-of-memory errors in kernel with Scale 24.04.1.1

NickF1227 · December 23, 2024, 11:39pm

Specifically, the issue appears to be RAM. I would start by adding up all of the memory you have allocated to each of your Apps plus VMs. What is that sum? That should represent the worst case scenario.

Subtract from 64 and how much is left?

edit: Can you also share the output of lspci -vv specifically for the section of your iGPU? Also cat /var/log/messages| grep GTT

I dont have an Intel iGPU so your logs may appear a little bit differant. In my case, it appears the Linux kernel is dynamically allocating as much as 16GiB for my AMD iGPU…This doesn’t mean it WILL use that much, just that it CAN. If you’re using it for Frigate…yeah…that certainly can be part of your problem.

root@prod[~]# cat /var/log/messages| grep GTT
Nov  4 10:41:57 truenas kernel: [drm] amdgpu: 15741M of GTT memory ready.
Nov 11 12:08:07 prod kernel: [drm] amdgpu: 15741M of GTT memory ready.
Nov 23 21:49:12 prod kernel: [drm] amdgpu: 15741M of GTT memory ready.
Dec 20 22:01:47 prod kernel: [drm] amdgpu: 15743M of GTT memory ready.
Dec 20 22:13:12 prod kernel: [drm] amdgpu: 15743M of GTT memory ready.
root@prod[~]#

10:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] (rev c9) (prog-if 00 [VGA controller])
	Subsystem: Gigabyte Technology Co., Ltd Cezanne [Radeon Vega Series / Radeon Vega Mobile Series]
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort+ <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 168
	IOMMU group: 4
	Region 0: Memory at d0000000 (64-bit, prefetchable) [size=256M]
	Region 2: Memory at e0000000 (64-bit, prefetchable) [size=2M]
	Region 4: I/O ports at e000 [size=256]
	Region 5: Memory at fce00000 (32-bit, non-prefetchable) [size=512K]
	Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [64] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s, Width x16
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
			 10BitTagComp+ 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
		LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
			 EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [a0] MSI: Enable- Count=1/4 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [c0] MSI-X: Enable+ Count=4 Masked-
		Vector table: BAR=5 offset=00042000
		PBA: BAR=5 offset=00043000
	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [270 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: 0
	Capabilities: [2b0 v1] Address Translation Service (ATS)
		ATSCap:	Invalidate Queue Depth: 00
		ATSCtl:	Enable+, Smallest Translation Unit: 00
	Capabilities: [2c0 v1] Page Request Interface (PRI)
		PRICtl: Enable+ Reset-
		PRISta: RF- UPRGI- Stopped+
		Page Request Capacity: 00000100, Page Request Allocation: 00000020
	Capabilities: [2d0 v1] Process Address Space ID (PASID)
		PASIDCap: Exec+ Priv+, Max PASID Width: 10
		PASIDCtl: Enable- Exec- Priv-
	Capabilities: [400 v1] Data Link Feature <?>
	Capabilities: [410 v1] Physical Layer 16.0 GT/s <?>
	Capabilities: [440 v1] Lane Margining at the Receiver <?>
	Kernel driver in use: amdgpu
	Kernel modules: amdgpu

I’m not an expert here, but I read this in my example as the GPU using a minimum of 256M+2M+512K with a maximum available of 16G.

NickF1227 · December 24, 2024, 1:06am

Doing some more research. Pulled from latest may be different depending on kernel version

AMD defaults:

gttsize (int)

Restrict the size of GTT domain (for userspace use) in MiB for testing. The default is -1 (Use 1/2 RAM, minimum value is 3GB).

https://www.kernel.org/doc/html/v6.13-rc3/gpu/amdgpu/module-parameters.html

Intel apparently works a bit differently than AMD and doesn’t appear to have a direct tunable for this documented
https://www.kernel.org/doc/html/latest/gpu/i915.html

Instead it may be tunable in your BIOS. Still I would expect the kernel to let you know on boot how much is allocated?
https://bwidawsk.net/blog/2014/6/the-global-gtt-part-1/

Momos · December 24, 2024, 5:23am

So,

The iGPU shows like:

00:02.0 VGA compatible controller: Intel Corporation RocketLake-S GT1 [UHD Graphics 750] (rev 04) (prog-if 00 [VGA controller])
        Subsystem: ASRock Incorporation RocketLake-S GT1 [UHD Graphics 750]
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 176
        IOMMU group: 0
        Region 0: Memory at 6001000000 (64-bit, non-prefetchable) [size=16M]
        Region 2: Memory at 4000000000 (64-bit, prefetchable) [size=256M]
        Region 4: I/O ports at 4000 [size=64]
        Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
        Capabilities: [40] Vendor Specific Information: Len=0c <?>
        Capabilities: [70] Express (v2) Root Complex Integrated Endpoint, MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0
                        ExtTag- RBE+ FLReset+
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
                        MaxPayload 128 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis- NROPrPrP- LTR-
                         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
        Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit-
                Address: fee00018  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [d0] Power Management version 2
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [100 v1] Process Address Space ID (PASID)
                PASIDCap: Exec- Priv-, Max PASID Width: 14
                PASIDCtl: Enable- Exec- Priv-
        Capabilities: [200 v1] Address Translation Service (ATS)
                ATSCap: Invalidate Queue Depth: 00
                ATSCtl: Enable-, Smallest Translation Unit: 00
        Capabilities: [300 v1] Page Request Interface (PRI)
                PRICtl: Enable- Reset-
                PRISta: RF- UPRGI- Stopped+
                Page Request Capacity: 00008000, Page Request Allocation: 00000000
        Kernel driver in use: i915
        Kernel modules: i915

The cat /var/log/messages | grep GTT doesn’t show any result:

root@truenas[~]# cat /var/log/messages| grep GTT
root@truenas[~]#

Total memory size allocated to VM’s and Apps is 60 GiB, but not much of it is used when those apps dont do anything - i posted pics from the system earlier.

Modifying the memory allocated to Frigate App does nothing, but modifyng the cores has a huge impact. With 4 cores is unusable, 6 cores works okeish, 8 cores is ok. Anyhow i disabled a few apps just to be on the safe side, making the total allocated 52 GiB.
Which leads me to believe again that all HW decoding is done on the CPU.
Even thou it looks that it’s working.

I’m at a complete loss

NickF1227 · December 24, 2024, 2:41pm

If you look at the blue line on your graph from Dec12 to today, it pretty clearly shows a stark contrast between the events before Dec 12 where the orange line shows you were cacheing almost nothing at all in ARC.

Can you try cat /var/log/messages | grep i915 instead?

The performance of Frigate is dependent on CPU cores, but that does not indicate anything about the system running out of RAM, which is why it crashes.

Frigate is using both CPU and GPU resources. If you dind’t have HW Accel on you’d need even more cores.

Momos · December 24, 2024, 2:45pm

cat /var/log/messages | grep i915 result.

root@truenas[~]# cat /var/log/messages | grep i915
Dec 17 16:40:58 truenas kernel: i915 0000:00:02.0: [drm] VT-d active for gfx access
Dec 17 16:40:58 truenas kernel: i915 0000:00:02.0: [drm] Using Transparent Hugepages
Dec 17 16:40:58 truenas kernel: i915 0000:00:02.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=io+mem
Dec 17 16:40:58 truenas kernel: mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04: bound 0000:00:02.0 (ops i915_hdcp_ops [i915])
Dec 17 16:40:58 truenas kernel: mei_pxp 0000:00:16.0-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:00:02.0 (ops i915_pxp_tee_component_ops [i915])
Dec 17 16:40:58 truenas kernel: i915 0000:00:02.0: [drm] Protected Xe Path (PXP) protected content support initialized
Dec 17 16:40:58 truenas kernel: i915 0000:00:02.0: [drm] Finished loading DMC firmware i915/rkl_dmc_ver2_03.bin (v2.3)
Dec 17 16:40:58 truenas kernel: [drm] Initialized i915 1.6.0 20201103 for 0000:00:02.0 on minor 0
Dec 17 16:40:58 truenas kernel: i915 0000:00:02.0: [drm] Cannot find any crtc or sizes
Dec 17 16:40:58 truenas kernel: i915 0000:03:00.0: [drm] VT-d active for gfx access
Dec 17 16:40:58 truenas kernel: i915 0000:00:02.0: [drm] Cannot find any crtc or sizes
Dec 17 16:40:58 truenas kernel: i915 0000:03:00.0: vgaarb: deactivate vga console
Dec 17 16:40:58 truenas kernel: i915 0000:03:00.0: [drm] Can't resize LMEM BAR - platform support is missing
Dec 17 16:40:58 truenas kernel: i915 0000:03:00.0: [drm] Local memory IO size: 0x0000000010000000
Dec 17 16:40:58 truenas kernel: i915 0000:03:00.0: [drm] Local memory available: 0x000000017c800000
Dec 17 16:40:58 truenas kernel: i915 0000:03:00.0: [drm] Using a reduced BAR size of 256MiB. Consider enabling 'Resizable BAR' or similar, if available in the BIOS.
Dec 17 16:40:58 truenas kernel: i915 0000:03:00.0: [drm] Finished loading DMC firmware i915/dg2_dmc_ver2_08.bin (v2.8)
Dec 17 16:40:58 truenas kernel: i915 0000:03:00.0: [drm] GT0: GuC firmware i915/dg2_guc_70.bin version 70.20.0
Dec 17 16:40:58 truenas kernel: i915 0000:03:00.0: [drm] GT0: HuC firmware i915/dg2_huc_gsc.bin version 7.10.15
Dec 17 16:40:58 truenas kernel: i915 0000:03:00.0: [drm] GT0: GUC: submission enabled
Dec 17 16:40:58 truenas kernel: i915 0000:03:00.0: [drm] GT0: GUC: SLPC enabled
Dec 17 16:40:58 truenas kernel: i915 0000:03:00.0: [drm] GT0: GUC: RC enabled
Dec 17 16:40:58 truenas kernel: [drm] Initialized i915 1.6.0 20201103 for 0000:03:00.0 on minor 1
Dec 17 16:40:58 truenas kernel: snd_hda_intel 0000:04:00.0: bound 0000:03:00.0 (ops i915_audio_component_bind_ops [i915])
Dec 17 16:40:58 truenas kernel: i915 0000:03:00.0: [drm] Cannot find any crtc or sizes
Dec 17 16:40:58 truenas kernel: i915 0000:03:00.0: [drm] Cannot find any crtc or sizes
Dec 17 16:40:58 truenas kernel: mei_gsc i915.mei-gscfi.768: FW not ready: resetting: dev_state = 2 pxp = 0
Dec 17 16:40:58 truenas kernel: mei_gsc i915.mei-gscfi.768: unexpected reset: dev_state = ENABLED fw status = 00000345 84670008 00000000 00000000 E0020002 00000000
Dec 17 16:40:58 truenas kernel: mei_gsc i915.mei-gsc.768: FW not ready: resetting: dev_state = 2 pxp = 2
Dec 17 16:40:58 truenas kernel: mei_gsc i915.mei-gsc.768: unexpected reset: dev_state = ENABLED fw status = 00000345 84670008 00000000 00000000 E0020002 00000000
Dec 17 16:40:59 truenas kernel: i915 0000:03:00.0: [drm] GT0: HuC: authenticated for all workloads
Dec 17 16:40:59 truenas kernel: mei_pxp i915.mei-gsc.768-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:03:00.0 (ops i915_pxp_tee_component_ops [i915])
Dec 18 08:18:57 truenas kernel: i915 0000:00:02.0: [drm] fb0: i915drmfb frame buffer device
Dec 18 08:42:37 truenas kernel: i915 0000:00:02.0: [drm] VT-d active for gfx access
Dec 18 08:42:37 truenas kernel: i915 0000:00:02.0: vgaarb: deactivate vga console
Dec 18 08:42:37 truenas kernel: i915 0000:00:02.0: [drm] Using Transparent Hugepages
Dec 18 08:42:37 truenas kernel: i915 0000:00:02.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=io+mem
Dec 18 08:42:37 truenas kernel: mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04: bound 0000:00:02.0 (ops i915_hdcp_ops [i915])
Dec 18 08:42:37 truenas kernel: mei_pxp 0000:00:16.0-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:00:02.0 (ops i915_pxp_tee_component_ops [i915])
Dec 18 08:42:37 truenas kernel: i915 0000:00:02.0: [drm] Protected Xe Path (PXP) protected content support initialized
Dec 18 08:42:37 truenas kernel: i915 0000:00:02.0: [drm] Finished loading DMC firmware i915/rkl_dmc_ver2_03.bin (v2.3)
Dec 18 08:42:37 truenas kernel: [drm] Initialized i915 1.6.0 20201103 for 0000:00:02.0 on minor 0
Dec 18 08:42:37 truenas kernel: i915 0000:03:00.0: [drm] VT-d active for gfx access
Dec 18 08:42:37 truenas kernel: i915 0000:03:00.0: BAR 0: releasing [mem 0xb0000000-0xb0ffffff 64bit]
Dec 18 08:42:37 truenas kernel: i915 0000:03:00.0: BAR 2: releasing [mem 0xa0000000-0xafffffff 64bit pref]
Dec 18 08:42:37 truenas kernel: i915 0000:03:00.0: BAR 2: assigned [mem 0x4200000000-0x43ffffffff 64bit pref]
Dec 18 08:42:37 truenas kernel: i915 0000:03:00.0: BAR 0: assigned [mem 0xb0000000-0xb0ffffff 64bit]
Dec 18 08:42:37 truenas kernel: i915 0000:03:00.0: [drm] BAR2 resized to 8192M
Dec 18 08:42:37 truenas kernel: i915 0000:03:00.0: [drm] Local memory IO size: 0x000000017c800000
Dec 18 08:42:37 truenas kernel: i915 0000:03:00.0: [drm] Local memory available: 0x000000017c800000
Dec 18 08:42:37 truenas kernel: i915 0000:03:00.0: [drm] Finished loading DMC firmware i915/dg2_dmc_ver2_08.bin (v2.8)
Dec 18 08:42:37 truenas kernel: fbcon: i915drmfb (fb0) is primary device
Dec 18 08:42:37 truenas kernel: i915 0000:03:00.0: [drm] GT0: GuC firmware i915/dg2_guc_70.bin version 70.20.0
Dec 18 08:42:37 truenas kernel: i915 0000:03:00.0: [drm] GT0: HuC firmware i915/dg2_huc_gsc.bin version 7.10.15
Dec 18 08:42:37 truenas kernel: i915 0000:03:00.0: [drm] GT0: GUC: submission enabled
Dec 18 08:42:37 truenas kernel: i915 0000:03:00.0: [drm] GT0: GUC: SLPC enabled
Dec 18 08:42:37 truenas kernel: i915 0000:03:00.0: [drm] GT0: GUC: RC enabled
Dec 18 08:42:37 truenas kernel: [drm] Initialized i915 1.6.0 20201103 for 0000:03:00.0 on minor 1
Dec 18 08:42:37 truenas kernel: snd_hda_intel 0000:04:00.0: bound 0000:03:00.0 (ops i915_audio_component_bind_ops [i915])
Dec 18 08:42:37 truenas kernel: i915 0000:03:00.0: [drm] Cannot find any crtc or sizes
Dec 18 08:42:37 truenas kernel: i915 0000:00:02.0: [drm] fb0: i915drmfb frame buffer device
Dec 18 08:42:37 truenas kernel: i915 0000:03:00.0: [drm] Cannot find any crtc or sizes
Dec 18 08:42:37 truenas kernel: mei_gsc i915.mei-gscfi.768: FW not ready: resetting: dev_state = 2 pxp = 0
Dec 18 08:42:37 truenas kernel: mei_gsc i915.mei-gscfi.768: unexpected reset: dev_state = ENABLED fw status = 00000345 84670008 00000000 00000000 E0020002 00000000
Dec 18 08:42:37 truenas kernel: mei_gsc i915.mei-gsc.768: FW not ready: resetting: dev_state = 2 pxp = 2
Dec 18 08:42:37 truenas kernel: mei_gsc i915.mei-gsc.768: unexpected reset: dev_state = ENABLED fw status = 00000345 84670008 00000000 00000000 E0020002 00000000
Dec 18 08:42:37 truenas kernel: i915 0000:03:00.0: [drm] GT0: HuC: authenticated for all workloads
Dec 18 08:42:37 truenas kernel: mei_pxp i915.mei-gsc.768-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:03:00.0 (ops i915_pxp_tee_component_ops [i915])
Dec 24 01:15:37 truenas kernel: Fence expiration time out i915-0000:03:00.0:ffmpeg[2761156]:42!
Dec 24 01:15:39 truenas kernel: Fence expiration time out i915-0000:03:00.0:ffmpeg[2761156]:44!
Dec 24 01:15:42 truenas kernel: Fence expiration time out i915-0000:03:00.0:ffmpeg[2761156]:4a!
Dec 24 01:15:43 truenas kernel: Fence expiration time out i915-0000:03:00.0:ffmpeg[2761156]:4c!
Dec 24 10:20:15 truenas kernel: i915 0000:00:02.0: [drm] VT-d active for gfx access
Dec 24 10:20:15 truenas kernel: i915 0000:00:02.0: [drm] Using Transparent Hugepages
Dec 24 10:20:15 truenas kernel: i915 0000:00:02.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=io+mem
Dec 24 10:20:15 truenas kernel: mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04: bound 0000:00:02.0 (ops i915_hdcp_ops [i915])
Dec 24 10:20:15 truenas kernel: mei_pxp 0000:00:16.0-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:00:02.0 (ops i915_pxp_tee_component_ops [i915])
Dec 24 10:20:15 truenas kernel: i915 0000:00:02.0: [drm] Protected Xe Path (PXP) protected content support initialized
Dec 24 10:20:15 truenas kernel: [drm] Initialized i915 1.6.0 20201103 for 0000:00:02.0 on minor 0
Dec 24 10:20:15 truenas kernel: i915 0000:00:02.0: [drm] Finished loading DMC firmware i915/rkl_dmc_ver2_03.bin (v2.3)
Dec 24 10:20:15 truenas kernel: i915 0000:00:02.0: [drm] Cannot find any crtc or sizes
Dec 24 10:20:15 truenas kernel: i915 0000:00:02.0: [drm] Cannot find any crtc or sizes
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: [drm] VT-d active for gfx access
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: vgaarb: deactivate vga console
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: BAR 0: releasing [mem 0xb0000000-0xb0ffffff 64bit]
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: BAR 2: releasing [mem 0xa0000000-0xafffffff 64bit pref]
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: BAR 2: assigned [mem 0x4200000000-0x43ffffffff 64bit pref]
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: BAR 0: assigned [mem 0xb0000000-0xb0ffffff 64bit]
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: [drm] BAR2 resized to 8192M
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: [drm] Local memory IO size: 0x000000017c800000
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: [drm] Local memory available: 0x000000017c800000
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: [drm] Finished loading DMC firmware i915/dg2_dmc_ver2_08.bin (v2.8)
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: [drm] GT0: GuC firmware i915/dg2_guc_70.bin version 70.20.0
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: [drm] GT0: HuC firmware i915/dg2_huc_gsc.bin version 7.10.15
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: [drm] GT0: GUC: submission enabled
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: [drm] GT0: GUC: SLPC enabled
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: [drm] GT0: GUC: RC enabled
Dec 24 10:20:15 truenas kernel: [drm] Initialized i915 1.6.0 20201103 for 0000:03:00.0 on minor 1
Dec 24 10:20:15 truenas kernel: snd_hda_intel 0000:04:00.0: bound 0000:03:00.0 (ops i915_audio_component_bind_ops [i915])
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: [drm] Cannot find any crtc or sizes
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: [drm] Cannot find any crtc or sizes
Dec 24 10:20:15 truenas kernel: mei_gsc i915.mei-gscfi.768: FW not ready: resetting: dev_state = 2 pxp = 0
Dec 24 10:20:15 truenas kernel: mei_gsc i915.mei-gscfi.768: unexpected reset: dev_state = ENABLED fw status = 00000345 84670008 00000000 00000000 E0020002 00000000
Dec 24 10:20:15 truenas kernel: mei_gsc i915.mei-gsc.768: FW not ready: resetting: dev_state = 2 pxp = 2
Dec 24 10:20:15 truenas kernel: mei_gsc i915.mei-gsc.768: unexpected reset: dev_state = ENABLED fw status = 00000345 84670008 00000000 00000000 E0020002 00000000
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: [drm] GT0: HuC: authenticated for all workloads
Dec 24 10:20:15 truenas kernel: mei_pxp i915.mei-gsc.768-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:03:00.0 (ops i915_pxp_tee_component_ops [i915])

NickF1227 · December 24, 2024, 2:49pm

Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: [drm] BAR2 resized to 8192M

I read that your iGPU can use 256M+16M+8192M of RAM in your system, so there’s a little less than 9GiB of RAM that is shared between the CPU and GPU that you would have to account for.

Momos · December 24, 2024, 2:57pm

ok. so i need to add up all the memory allocated to apps and VM and still have at least 10GiB availble.
But aren’t apps that are not doing anyting not using any memory ? from the history is doesnt look like my system is actually using more then 30 GiB. It should be plenty available.

By the way:
intel_top result is:

root@truenas[~]# intel_gpu_top -L                   
card1                    Intel Dg2 (Gen12)                 pci:vendor=8086,device=56A5,card=0
└─renderD129            
card0                    Intel Rocketlake (Gen12)          pci:vendor=8086,device=4C8A,card=0
└─renderD128

and intel_gpu_top -d drm:/dev/dri/card0

and intel_gpu_top -d drm:/dev/dri/card1

Does this mean that i actually managed to activate the Intel Arc380 for Frigate ?
As this was also something that i’ve been struggling for a while. To make the Frigate app pick and use the Arc380, not the iGPU - should help with the memory issue also.

NickF1227 · December 24, 2024, 3:10pm

More appropriately, you have 10 GiB available for the host operating system. TrueNAS needs at least 8GiB for itself, Leaving nothing for the ARC cache at all (Not related to ARC GPU, ZFS Adaptive Replacement Cache).

Since your graph shows ARC is actually using as much as 30GiB, that is why you get an oom-kill for Frigate. I believe since Frigate was consuming the most RAM, and the kernel was trying to prevent itself from crashing, it chose to kill Frigate over some other process.

The way this works in SCALE is both will be available for the container, and it does look like the container is not using the iGPU in that output from intel_gpu_top. So that extra 8GiB of RAM may not be being utilized.

Really the “correct” way to fix your issues is to give you system more RAM or reduce the number of services being used. However, you can artificially reduce or even disable ARC caching as a cheeky workaround to gain some stability at the cost of disk performance.

This example will limit ARC to 2GiB.
https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#zfs-arc-max

echo 2147483648 > /sys/module/zfs/parameters/zfs_arc_max

Momos · December 24, 2024, 3:23pm

Awesome explanations !! It’s starting to make more sense in my head
Thank you for the patience also !

“The way this works in SCALE is both will be available for the container” - it’s weird what the solution to force the Frigate App to pick the ARC card was:
I deselected the “Passthrough available (non-NVIDIA) GPUs” checkbox and
added the renderD129 device but pointing at the renderD128 in the container. I guess the Frigate App only knows about renderD128.

Took a lot of trial/error.

Now, to solve all the memory issues i’ll pick up another 64GiB to install

And there is only one more thing to solve, wait for Scale EE release with the drivers for Dual TPU Coral PCIE so i can put those to work also.

root@truenas[~]# lspci | grep TPU                   
09:00.0 System peripheral: Global Unichip Corp. Coral Edge TPU
0a:00.0 System peripheral: Global Unichip Corp. Coral Edge TPU

NickF1227 · December 24, 2024, 3:25pm

The only thing I’m not clear on is what changed on December 12?

Momos · December 24, 2024, 3:36pm

I think that was the day i installed Ollama and Openweb-ui Apps
They didn’t work - I could not make them work with the A380 so i kinda abandoned them. But they were still active, not doing anything. Now i have disabled those apps.
I really thought an app that does not do anything doesn’t eat up any resources…

I was on Truenas Core until 4 weeks ago, upgraded to Scale 24.10 and obviously a new world opened

sfatula · December 26, 2024, 11:36pm

Still wish Swap came back for reasons like this. No need to crash machines…

winnielinnie · December 26, 2024, 11:52pm

A wise man once said,

Momos · December 27, 2024, 10:17am

Happy Holidays !!!

Reporting back: no more oom since my last post Many thanks to all

Meanwhile Santa came with some RAM also , so sitting on 128GiB now. Should cover all the needs, keeping apps, VM’s and ZFS happy.

kris · December 27, 2024, 4:07pm

By eliminating swap we’ve seen a whole host of esoteric failure reports go away. The issue was that it didn’t just stop crashes, it still crashed, but just in a wide-variety of harder to diagnose ways. I.E. services would “hang”, processes crash, stalled performance issues, etc. Now when you get an OOM killer its very simple to troubleshoot and rectify. Either a blatant memory leak (easy to find) or the system is just oversubscribed, also easy to fix.

winnielinnie · December 27, 2024, 8:08pm

Curious. Did all (or nearly all) of such cases involve TrueNAS hosting VMs?

If VMs are taken out of the picture, do the OOM problems disappear?