ZFS_ARC_MAX issue - out-of-memory errors in kernel with Scale 24.04.1.1

Specifically, the issue appears to be RAM. I would start by adding up all of the memory you have allocated to each of your Apps plus VMs. What is that sum? That should represent the worst case scenario.

Subtract from 64 and how much is left?

edit: Can you also share the output of lspci -vv specifically for the section of your iGPU? Also cat /var/log/messages| grep GTT

I dont have an Intel iGPU so your logs may appear a little bit differant. In my case, it appears the Linux kernel is dynamically allocating as much as 16GiB for my AMD iGPUā€¦This doesnā€™t mean it WILL use that much, just that it CAN. If youā€™re using it for Frigateā€¦yeahā€¦that certainly can be part of your problem.

root@prod[~]# cat /var/log/messages| grep GTT
Nov  4 10:41:57 truenas kernel: [drm] amdgpu: 15741M of GTT memory ready.
Nov 11 12:08:07 prod kernel: [drm] amdgpu: 15741M of GTT memory ready.
Nov 23 21:49:12 prod kernel: [drm] amdgpu: 15741M of GTT memory ready.
Dec 20 22:01:47 prod kernel: [drm] amdgpu: 15743M of GTT memory ready.
Dec 20 22:13:12 prod kernel: [drm] amdgpu: 15743M of GTT memory ready.
root@prod[~]# 

10:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] (rev c9) (prog-if 00 [VGA controller])
	Subsystem: Gigabyte Technology Co., Ltd Cezanne [Radeon Vega Series / Radeon Vega Mobile Series]
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort+ <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 168
	IOMMU group: 4
	Region 0: Memory at d0000000 (64-bit, prefetchable) [size=256M]
	Region 2: Memory at e0000000 (64-bit, prefetchable) [size=2M]
	Region 4: I/O ports at e000 [size=256]
	Region 5: Memory at fce00000 (32-bit, non-prefetchable) [size=512K]
	Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [64] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s, Width x16
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
			 10BitTagComp+ 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
		LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
			 EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [a0] MSI: Enable- Count=1/4 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [c0] MSI-X: Enable+ Count=4 Masked-
		Vector table: BAR=5 offset=00042000
		PBA: BAR=5 offset=00043000
	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [270 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: 0
	Capabilities: [2b0 v1] Address Translation Service (ATS)
		ATSCap:	Invalidate Queue Depth: 00
		ATSCtl:	Enable+, Smallest Translation Unit: 00
	Capabilities: [2c0 v1] Page Request Interface (PRI)
		PRICtl: Enable+ Reset-
		PRISta: RF- UPRGI- Stopped+
		Page Request Capacity: 00000100, Page Request Allocation: 00000020
	Capabilities: [2d0 v1] Process Address Space ID (PASID)
		PASIDCap: Exec+ Priv+, Max PASID Width: 10
		PASIDCtl: Enable- Exec- Priv-
	Capabilities: [400 v1] Data Link Feature <?>
	Capabilities: [410 v1] Physical Layer 16.0 GT/s <?>
	Capabilities: [440 v1] Lane Margining at the Receiver <?>
	Kernel driver in use: amdgpu
	Kernel modules: amdgpu

Iā€™m not an expert here, but I read this in my example as the GPU using a minimum of 256M+2M+512K with a maximum available of 16G.

1 Like

Doing some more research. Pulled from latest may be different depending on kernel version

AMD defaults:

gttsize (int)

Restrict the size of GTT domain (for userspace use) in MiB for testing. The default is -1 (Use 1/2 RAM, minimum value is 3GB).

https://www.kernel.org/doc/html/v6.13-rc3/gpu/amdgpu/module-parameters.html

Intel apparently works a bit differently than AMD and doesnā€™t appear to have a direct tunable for this documented
https://www.kernel.org/doc/html/latest/gpu/i915.html

Instead it may be tunable in your BIOS. Still I would expect the kernel to let you know on boot how much is allocated?
https://bwidawsk.net/blog/2014/6/the-global-gtt-part-1/

So,

The iGPU shows like:

00:02.0 VGA compatible controller: Intel Corporation RocketLake-S GT1 [UHD Graphics 750] (rev 04) (prog-if 00 [VGA controller])
        Subsystem: ASRock Incorporation RocketLake-S GT1 [UHD Graphics 750]
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 176
        IOMMU group: 0
        Region 0: Memory at 6001000000 (64-bit, non-prefetchable) [size=16M]
        Region 2: Memory at 4000000000 (64-bit, prefetchable) [size=256M]
        Region 4: I/O ports at 4000 [size=64]
        Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
        Capabilities: [40] Vendor Specific Information: Len=0c <?>
        Capabilities: [70] Express (v2) Root Complex Integrated Endpoint, MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0
                        ExtTag- RBE+ FLReset+
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
                        MaxPayload 128 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis- NROPrPrP- LTR-
                         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
        Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit-
                Address: fee00018  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [d0] Power Management version 2
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [100 v1] Process Address Space ID (PASID)
                PASIDCap: Exec- Priv-, Max PASID Width: 14
                PASIDCtl: Enable- Exec- Priv-
        Capabilities: [200 v1] Address Translation Service (ATS)
                ATSCap: Invalidate Queue Depth: 00
                ATSCtl: Enable-, Smallest Translation Unit: 00
        Capabilities: [300 v1] Page Request Interface (PRI)
                PRICtl: Enable- Reset-
                PRISta: RF- UPRGI- Stopped+
                Page Request Capacity: 00008000, Page Request Allocation: 00000000
        Kernel driver in use: i915
        Kernel modules: i915

The cat /var/log/messages | grep GTT doesnā€™t show any result:

root@truenas[~]# cat /var/log/messages| grep GTT
root@truenas[~]# 

Total memory size allocated to VMā€™s and Apps is 60 GiB, but not much of it is used when those apps dont do anything - i posted pics from the system earlier.

Modifying the memory allocated to Frigate App does nothing, but modifyng the cores has a huge impact. With 4 cores is unusable, 6 cores works okeish, 8 cores is ok. Anyhow i disabled a few apps just to be on the safe side, making the total allocated 52 GiB.
Which leads me to believe again that all HW decoding is done on the CPU.
Even thou it looks that itā€™s working.

Iā€™m at a complete loss :frowning:

If you look at the blue line on your graph from Dec12 to today, it pretty clearly shows a stark contrast between the events before Dec 12 where the orange line shows you were cacheing almost nothing at all in ARC.

Can you try cat /var/log/messages | grep i915 instead?

The performance of Frigate is dependent on CPU cores, but that does not indicate anything about the system running out of RAM, which is why it crashes.

Frigate is using both CPU and GPU resources. If you dindā€™t have HW Accel on youā€™d need even more cores.

cat /var/log/messages | grep i915 result.

root@truenas[~]# cat /var/log/messages | grep i915
Dec 17 16:40:58 truenas kernel: i915 0000:00:02.0: [drm] VT-d active for gfx access
Dec 17 16:40:58 truenas kernel: i915 0000:00:02.0: [drm] Using Transparent Hugepages
Dec 17 16:40:58 truenas kernel: i915 0000:00:02.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=io+mem
Dec 17 16:40:58 truenas kernel: mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04: bound 0000:00:02.0 (ops i915_hdcp_ops [i915])
Dec 17 16:40:58 truenas kernel: mei_pxp 0000:00:16.0-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:00:02.0 (ops i915_pxp_tee_component_ops [i915])
Dec 17 16:40:58 truenas kernel: i915 0000:00:02.0: [drm] Protected Xe Path (PXP) protected content support initialized
Dec 17 16:40:58 truenas kernel: i915 0000:00:02.0: [drm] Finished loading DMC firmware i915/rkl_dmc_ver2_03.bin (v2.3)
Dec 17 16:40:58 truenas kernel: [drm] Initialized i915 1.6.0 20201103 for 0000:00:02.0 on minor 0
Dec 17 16:40:58 truenas kernel: i915 0000:00:02.0: [drm] Cannot find any crtc or sizes
Dec 17 16:40:58 truenas kernel: i915 0000:03:00.0: [drm] VT-d active for gfx access
Dec 17 16:40:58 truenas kernel: i915 0000:00:02.0: [drm] Cannot find any crtc or sizes
Dec 17 16:40:58 truenas kernel: i915 0000:03:00.0: vgaarb: deactivate vga console
Dec 17 16:40:58 truenas kernel: i915 0000:03:00.0: [drm] Can't resize LMEM BAR - platform support is missing
Dec 17 16:40:58 truenas kernel: i915 0000:03:00.0: [drm] Local memory IO size: 0x0000000010000000
Dec 17 16:40:58 truenas kernel: i915 0000:03:00.0: [drm] Local memory available: 0x000000017c800000
Dec 17 16:40:58 truenas kernel: i915 0000:03:00.0: [drm] Using a reduced BAR size of 256MiB. Consider enabling 'Resizable BAR' or similar, if available in the BIOS.
Dec 17 16:40:58 truenas kernel: i915 0000:03:00.0: [drm] Finished loading DMC firmware i915/dg2_dmc_ver2_08.bin (v2.8)
Dec 17 16:40:58 truenas kernel: i915 0000:03:00.0: [drm] GT0: GuC firmware i915/dg2_guc_70.bin version 70.20.0
Dec 17 16:40:58 truenas kernel: i915 0000:03:00.0: [drm] GT0: HuC firmware i915/dg2_huc_gsc.bin version 7.10.15
Dec 17 16:40:58 truenas kernel: i915 0000:03:00.0: [drm] GT0: GUC: submission enabled
Dec 17 16:40:58 truenas kernel: i915 0000:03:00.0: [drm] GT0: GUC: SLPC enabled
Dec 17 16:40:58 truenas kernel: i915 0000:03:00.0: [drm] GT0: GUC: RC enabled
Dec 17 16:40:58 truenas kernel: [drm] Initialized i915 1.6.0 20201103 for 0000:03:00.0 on minor 1
Dec 17 16:40:58 truenas kernel: snd_hda_intel 0000:04:00.0: bound 0000:03:00.0 (ops i915_audio_component_bind_ops [i915])
Dec 17 16:40:58 truenas kernel: i915 0000:03:00.0: [drm] Cannot find any crtc or sizes
Dec 17 16:40:58 truenas kernel: i915 0000:03:00.0: [drm] Cannot find any crtc or sizes
Dec 17 16:40:58 truenas kernel: mei_gsc i915.mei-gscfi.768: FW not ready: resetting: dev_state = 2 pxp = 0
Dec 17 16:40:58 truenas kernel: mei_gsc i915.mei-gscfi.768: unexpected reset: dev_state = ENABLED fw status = 00000345 84670008 00000000 00000000 E0020002 00000000
Dec 17 16:40:58 truenas kernel: mei_gsc i915.mei-gsc.768: FW not ready: resetting: dev_state = 2 pxp = 2
Dec 17 16:40:58 truenas kernel: mei_gsc i915.mei-gsc.768: unexpected reset: dev_state = ENABLED fw status = 00000345 84670008 00000000 00000000 E0020002 00000000
Dec 17 16:40:59 truenas kernel: i915 0000:03:00.0: [drm] GT0: HuC: authenticated for all workloads
Dec 17 16:40:59 truenas kernel: mei_pxp i915.mei-gsc.768-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:03:00.0 (ops i915_pxp_tee_component_ops [i915])
Dec 18 08:18:57 truenas kernel: i915 0000:00:02.0: [drm] fb0: i915drmfb frame buffer device
Dec 18 08:42:37 truenas kernel: i915 0000:00:02.0: [drm] VT-d active for gfx access
Dec 18 08:42:37 truenas kernel: i915 0000:00:02.0: vgaarb: deactivate vga console
Dec 18 08:42:37 truenas kernel: i915 0000:00:02.0: [drm] Using Transparent Hugepages
Dec 18 08:42:37 truenas kernel: i915 0000:00:02.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=io+mem
Dec 18 08:42:37 truenas kernel: mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04: bound 0000:00:02.0 (ops i915_hdcp_ops [i915])
Dec 18 08:42:37 truenas kernel: mei_pxp 0000:00:16.0-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:00:02.0 (ops i915_pxp_tee_component_ops [i915])
Dec 18 08:42:37 truenas kernel: i915 0000:00:02.0: [drm] Protected Xe Path (PXP) protected content support initialized
Dec 18 08:42:37 truenas kernel: i915 0000:00:02.0: [drm] Finished loading DMC firmware i915/rkl_dmc_ver2_03.bin (v2.3)
Dec 18 08:42:37 truenas kernel: [drm] Initialized i915 1.6.0 20201103 for 0000:00:02.0 on minor 0
Dec 18 08:42:37 truenas kernel: i915 0000:03:00.0: [drm] VT-d active for gfx access
Dec 18 08:42:37 truenas kernel: i915 0000:03:00.0: BAR 0: releasing [mem 0xb0000000-0xb0ffffff 64bit]
Dec 18 08:42:37 truenas kernel: i915 0000:03:00.0: BAR 2: releasing [mem 0xa0000000-0xafffffff 64bit pref]
Dec 18 08:42:37 truenas kernel: i915 0000:03:00.0: BAR 2: assigned [mem 0x4200000000-0x43ffffffff 64bit pref]
Dec 18 08:42:37 truenas kernel: i915 0000:03:00.0: BAR 0: assigned [mem 0xb0000000-0xb0ffffff 64bit]
Dec 18 08:42:37 truenas kernel: i915 0000:03:00.0: [drm] BAR2 resized to 8192M
Dec 18 08:42:37 truenas kernel: i915 0000:03:00.0: [drm] Local memory IO size: 0x000000017c800000
Dec 18 08:42:37 truenas kernel: i915 0000:03:00.0: [drm] Local memory available: 0x000000017c800000
Dec 18 08:42:37 truenas kernel: i915 0000:03:00.0: [drm] Finished loading DMC firmware i915/dg2_dmc_ver2_08.bin (v2.8)
Dec 18 08:42:37 truenas kernel: fbcon: i915drmfb (fb0) is primary device
Dec 18 08:42:37 truenas kernel: i915 0000:03:00.0: [drm] GT0: GuC firmware i915/dg2_guc_70.bin version 70.20.0
Dec 18 08:42:37 truenas kernel: i915 0000:03:00.0: [drm] GT0: HuC firmware i915/dg2_huc_gsc.bin version 7.10.15
Dec 18 08:42:37 truenas kernel: i915 0000:03:00.0: [drm] GT0: GUC: submission enabled
Dec 18 08:42:37 truenas kernel: i915 0000:03:00.0: [drm] GT0: GUC: SLPC enabled
Dec 18 08:42:37 truenas kernel: i915 0000:03:00.0: [drm] GT0: GUC: RC enabled
Dec 18 08:42:37 truenas kernel: [drm] Initialized i915 1.6.0 20201103 for 0000:03:00.0 on minor 1
Dec 18 08:42:37 truenas kernel: snd_hda_intel 0000:04:00.0: bound 0000:03:00.0 (ops i915_audio_component_bind_ops [i915])
Dec 18 08:42:37 truenas kernel: i915 0000:03:00.0: [drm] Cannot find any crtc or sizes
Dec 18 08:42:37 truenas kernel: i915 0000:00:02.0: [drm] fb0: i915drmfb frame buffer device
Dec 18 08:42:37 truenas kernel: i915 0000:03:00.0: [drm] Cannot find any crtc or sizes
Dec 18 08:42:37 truenas kernel: mei_gsc i915.mei-gscfi.768: FW not ready: resetting: dev_state = 2 pxp = 0
Dec 18 08:42:37 truenas kernel: mei_gsc i915.mei-gscfi.768: unexpected reset: dev_state = ENABLED fw status = 00000345 84670008 00000000 00000000 E0020002 00000000
Dec 18 08:42:37 truenas kernel: mei_gsc i915.mei-gsc.768: FW not ready: resetting: dev_state = 2 pxp = 2
Dec 18 08:42:37 truenas kernel: mei_gsc i915.mei-gsc.768: unexpected reset: dev_state = ENABLED fw status = 00000345 84670008 00000000 00000000 E0020002 00000000
Dec 18 08:42:37 truenas kernel: i915 0000:03:00.0: [drm] GT0: HuC: authenticated for all workloads
Dec 18 08:42:37 truenas kernel: mei_pxp i915.mei-gsc.768-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:03:00.0 (ops i915_pxp_tee_component_ops [i915])
Dec 24 01:15:37 truenas kernel: Fence expiration time out i915-0000:03:00.0:ffmpeg[2761156]:42!
Dec 24 01:15:39 truenas kernel: Fence expiration time out i915-0000:03:00.0:ffmpeg[2761156]:44!
Dec 24 01:15:42 truenas kernel: Fence expiration time out i915-0000:03:00.0:ffmpeg[2761156]:4a!
Dec 24 01:15:43 truenas kernel: Fence expiration time out i915-0000:03:00.0:ffmpeg[2761156]:4c!
Dec 24 10:20:15 truenas kernel: i915 0000:00:02.0: [drm] VT-d active for gfx access
Dec 24 10:20:15 truenas kernel: i915 0000:00:02.0: [drm] Using Transparent Hugepages
Dec 24 10:20:15 truenas kernel: i915 0000:00:02.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=io+mem
Dec 24 10:20:15 truenas kernel: mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04: bound 0000:00:02.0 (ops i915_hdcp_ops [i915])
Dec 24 10:20:15 truenas kernel: mei_pxp 0000:00:16.0-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:00:02.0 (ops i915_pxp_tee_component_ops [i915])
Dec 24 10:20:15 truenas kernel: i915 0000:00:02.0: [drm] Protected Xe Path (PXP) protected content support initialized
Dec 24 10:20:15 truenas kernel: [drm] Initialized i915 1.6.0 20201103 for 0000:00:02.0 on minor 0
Dec 24 10:20:15 truenas kernel: i915 0000:00:02.0: [drm] Finished loading DMC firmware i915/rkl_dmc_ver2_03.bin (v2.3)
Dec 24 10:20:15 truenas kernel: i915 0000:00:02.0: [drm] Cannot find any crtc or sizes
Dec 24 10:20:15 truenas kernel: i915 0000:00:02.0: [drm] Cannot find any crtc or sizes
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: [drm] VT-d active for gfx access
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: vgaarb: deactivate vga console
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: BAR 0: releasing [mem 0xb0000000-0xb0ffffff 64bit]
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: BAR 2: releasing [mem 0xa0000000-0xafffffff 64bit pref]
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: BAR 2: assigned [mem 0x4200000000-0x43ffffffff 64bit pref]
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: BAR 0: assigned [mem 0xb0000000-0xb0ffffff 64bit]
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: [drm] BAR2 resized to 8192M
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: [drm] Local memory IO size: 0x000000017c800000
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: [drm] Local memory available: 0x000000017c800000
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: [drm] Finished loading DMC firmware i915/dg2_dmc_ver2_08.bin (v2.8)
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: [drm] GT0: GuC firmware i915/dg2_guc_70.bin version 70.20.0
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: [drm] GT0: HuC firmware i915/dg2_huc_gsc.bin version 7.10.15
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: [drm] GT0: GUC: submission enabled
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: [drm] GT0: GUC: SLPC enabled
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: [drm] GT0: GUC: RC enabled
Dec 24 10:20:15 truenas kernel: [drm] Initialized i915 1.6.0 20201103 for 0000:03:00.0 on minor 1
Dec 24 10:20:15 truenas kernel: snd_hda_intel 0000:04:00.0: bound 0000:03:00.0 (ops i915_audio_component_bind_ops [i915])
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: [drm] Cannot find any crtc or sizes
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: [drm] Cannot find any crtc or sizes
Dec 24 10:20:15 truenas kernel: mei_gsc i915.mei-gscfi.768: FW not ready: resetting: dev_state = 2 pxp = 0
Dec 24 10:20:15 truenas kernel: mei_gsc i915.mei-gscfi.768: unexpected reset: dev_state = ENABLED fw status = 00000345 84670008 00000000 00000000 E0020002 00000000
Dec 24 10:20:15 truenas kernel: mei_gsc i915.mei-gsc.768: FW not ready: resetting: dev_state = 2 pxp = 2
Dec 24 10:20:15 truenas kernel: mei_gsc i915.mei-gsc.768: unexpected reset: dev_state = ENABLED fw status = 00000345 84670008 00000000 00000000 E0020002 00000000
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: [drm] GT0: HuC: authenticated for all workloads
Dec 24 10:20:15 truenas kernel: mei_pxp i915.mei-gsc.768-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:03:00.0 (ops i915_pxp_tee_component_ops [i915])
Dec 24 10:20:15 truenas kernel: i915 0000:03:00.0: [drm] BAR2 resized to 8192M

I read that your iGPU can use 256M+16M+8192M of RAM in your system, so thereā€™s a little less than 9GiB of RAM that is shared between the CPU and GPU that you would have to account for.

ok. so i need to add up all the memory allocated to apps and VM and still have at least 10GiB availble.
But arenā€™t apps that are not doing anyting not using any memory ? from the history is doesnt look like my system is actually using more then 30 GiB. It should be plenty available.

By the way:
intel_top result is:

root@truenas[~]# intel_gpu_top -L                   
card1                    Intel Dg2 (Gen12)                 pci:vendor=8086,device=56A5,card=0
ā””ā”€renderD129            
card0                    Intel Rocketlake (Gen12)          pci:vendor=8086,device=4C8A,card=0
ā””ā”€renderD128   

and intel_gpu_top -d drm:/dev/dri/card0

and intel_gpu_top -d drm:/dev/dri/card1

Does this mean that i actually managed to activate the Intel Arc380 for Frigate ?
As this was also something that iā€™ve been struggling for a while. To make the Frigate app pick and use the Arc380, not the iGPU - should help with the memory issue also.

More appropriately, you have 10 GiB available for the host operating system. TrueNAS needs at least 8GiB for itself, Leaving nothing for the ARC cache at all (Not related to ARC GPU, ZFS Adaptive Replacement Cache).

Since your graph shows ARC is actually using as much as 30GiB, that is why you get an oom-kill for Frigate. I believe since Frigate was consuming the most RAM, and the kernel was trying to prevent itself from crashing, it chose to kill Frigate over some other process.

The way this works in SCALE is both will be available for the container, and it does look like the container is not using the iGPU in that output from intel_gpu_top. So that extra 8GiB of RAM may not be being utilized.

Really the ā€œcorrectā€ way to fix your issues is to give you system more RAM or reduce the number of services being used. However, you can artificially reduce or even disable ARC caching as a cheeky workaround to gain some stability at the cost of disk performance.

This example will limit ARC to 2GiB.
https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#zfs-arc-max

echo 2147483648 > /sys/module/zfs/parameters/zfs_arc_max

Awesome explanations !! Itā€™s starting to make more sense in my head :slight_smile:
Thank you for the patience also !

ā€œThe way this works in SCALE is both will be available for the containerā€ - itā€™s weird what the solution to force the Frigate App to pick the ARC card was:
I deselected the ā€œPassthrough available (non-NVIDIA) GPUsā€ checkbox and
added the renderD129 device but pointing at the renderD128 in the container. I guess the Frigate App only knows about renderD128.

image

Took a lot of trial/error.

Now, to solve all the memory issues iā€™ll pick up another 64GiB to install :slight_smile:

And there is only one more thing to solve, wait for Scale EE release with the drivers for Dual TPU Coral PCIE so i can put those to work also.

root@truenas[~]# lspci | grep TPU                   
09:00.0 System peripheral: Global Unichip Corp. Coral Edge TPU
0a:00.0 System peripheral: Global Unichip Corp. Coral Edge TPU
1 Like

The only thing Iā€™m not clear on is what changed on December 12?

I think that was the day i installed Ollama and Openweb-ui Apps :upside_down_face:
They didnā€™t work - I could not make them work with the A380 so i kinda abandoned them. But they were still active, not doing anything. Now i have disabled those apps.
I really thought an app that does not do anything doesnā€™t eat up any resourcesā€¦

I was on Truenas Core until 4 weeks ago, upgraded to Scale 24.10 and obviously a new world opened :yum:

Still wish Swap came back for reasons like this. No need to crash machinesā€¦

A wise man once said,

2 Likes

Happy Holidays !!!

Reporting back: no more oom since my last post :slight_smile: Many thanks to all :star_struck:

Meanwhile Santa came with some RAM also , so sitting on 128GiB now. Should cover all the needs, keeping apps, VMā€™s and ZFS happy.

1 Like

By eliminating swap weā€™ve seen a whole host of esoteric failure reports go away. The issue was that it didnā€™t just stop crashes, it still crashed, but just in a wide-variety of harder to diagnose ways. I.E. services would ā€œhangā€, processes crash, stalled performance issues, etc. Now when you get an OOM killer its very simple to troubleshoot and rectify. Either a blatant memory leak (easy to find) or the system is just oversubscribed, also easy to fix.

3 Likes

Curious. Did all (or nearly all) of such cases involve TrueNAS hosting VMs?

If VMs are taken out of the picture, do the OOM problems disappear?