Repeated DMAR: errors since upgrade to 25.04.1

I just updated from 24.10.2.2 to 25.04.1 and immediately noticed a constant stream of messages in the console:

May 29 12:20:04 eurybia kernel: dmar_fault: 13029 callbacks suppressed
May 29 12:20:09 eurybia kernel: dmar_fault: 13068 callbacks suppressed
May 29 12:20:14 eurybia kernel: dmar_fault: 13038 callbacks suppressed

Looking at dmesg output gives a little more detail:

[ 3640.688203] dmar_fault: 13206 callbacks suppressed
[ 3640.688213] DMAR: DRHD: handling fault status reg 702
[ 3640.709473] DMAR: [DMA Read NO_PASID] Request device [00:1e.0] fault addr 0x6048000 [fault reason 0x02] Present bit in context entry is clear
[ 3640.732456] DMAR: [DMA Read NO_PASID] Request device [00:1e.0] fault addr 0x6048000 [fault reason 0x02] Present bit in context entry is clear
[ 3640.756838] DMAR: [DMA Read NO_PASID] Request device [00:1e.0] fault addr 0x6048000 [fault reason 0x02] Present bit in context entry is clear
[ 3640.783072] DMAR: [DMA Read NO_PASID] Request device [00:1e.0] fault addr 0x6048000 [fault reason 0x02] Present bit in context entry is clear
[ 3640.810959] DMAR: [DMA Read NO_PASID] Request device [00:1e.0] fault addr 0x6048000 [fault reason 0x02] Present bit in context entry is clear
[ 3640.840416] DMAR: [DMA Read NO_PASID] Request device [00:1e.0] fault addr 0x6048000 [fault reason 0x02] Present bit in context entry is clear
[ 3640.871848] DMAR: [DMA Read NO_PASID] Request device [00:1e.0] fault addr 0x6048000 [fault reason 0x02] Present bit in context entry is clear
[ 3640.905073] DMAR: [DMA Read NO_PASID] Request device [00:1e.0] fault addr 0x6048000 [fault reason 0x02] Present bit in context entry is clear
[ 3640.939412] DMAR: [DMA Read NO_PASID] Request device [00:1e.0] fault addr 0x6048000 [fault reason 0x02] Present bit in context entry is clear

I’m also sometimes seeing a burst of DRHD messages every once in a while:

[ 4301.214471] DMAR: DRHD: handling fault status reg 102
[ 4301.214476] DMAR: [DMA Read NO_PASID] Request device [00:1e.0] fault addr 0x6048000 [fault reason 0x02] Present bit in context entry is clear
[ 4301.215562] DMAR: DRHD: handling fault status reg 202
[ 4301.215571] DMAR: [DMA Read NO_PASID] Request device [00:1e.0] fault addr 0x6048000 [fault reason 0x02] Present bit in context entry is clear
[ 4301.216639] DMAR: DRHD: handling fault status reg 302
[ 4301.216647] DMAR: [DMA Read NO_PASID] Request device [00:1e.0] fault addr 0x6048000 [fault reason 0x02] Present bit in context entry is clear
[ 4301.217716] DMAR: DRHD: handling fault status reg 402

Looking at lspci tells me that device [00:1e.0] is the PCI bridge:

00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)

This behaviour is definitely new as of the upgrade today to 25.04.1. The server appears to be running okay The server feels like it’s a bit sluggish, and I can’t edit the properties of any Apps. Clicking the Edit button results in a permanently spinning “Please wait” message, and the sheer volume of errors is concerning

Hardware details are in my sig, but it’s basically an old HP DL380 G6, with the latest available BIOS applied (P62 05/21/2018). There is no GPU installed, and only a Dell Perc H310 controller and an HP SAS expander in the PCI cage. I am running no VMs/Instances

I’m guessing this is something introduced via the new kernel version in this release, and have found a few references online to linux IOMMU support relating to these messages. I suspect the inability to edit Apps is due to it’s trying to poll the status of GPU passthrough? :thinking:

I’ve found a few threads suggesting that adding “intel_iommu=off” to the kernel parameters solves this, but I don’t want to try a solution I don’t understand before running it past the forum here.

Any comments/suggestions/dire warnings?

1 Like

UPDATE:

I decided to try rebooting the server, and it’s been shutting down for about 15 minutes now, still spitting out the DMAR error messages on the console every 5 seconds.

Every now and again it tells me that it’s failed to unmount a load of file systems, and that watchdog failed to stop. I think it’s quite poorly :fearful:

edit I power cycled it, and added the ‘intel_iommu=off’ kernel parameter. The DMAR error messages are no longer being generated, and I can edit Apps again.

The immediate problems of the error messages, system sluggishness and inability to edit Apps seem to be resolved, but I’m unclear as to what the larger impact of disabling IOMMU may be. My research suggests it’s to do with Intel VT-d tech and virtualisation, so in my case I shouldn’t be affected, but it could bite me in the future.

I’m a little concerned that this may be come back again in the future with another upgrade, so am tempted to disable it in the BIOS now.

1 Like

This has been a huge pain in the butt for me as well, might affect all HP G6 hardware.

I disabled Intel Virtualization in BIOS, made the system boot cleanly. No more DMAR messages

1 Like

yup, definitely the virtualization, disabled mine, messages are gone

+1 for the same issue on an old HP G6 server.

In the BIOS (press F9) under “System configuration” >> “Processor” you need to disable:

  • Intel Virtualisation Technology
  • Intel VT-d

Normal bootup after this change. My installed apps seems to work just fine.

3 Likes

thanks, it works for me with same HP G6 server.

1 Like

Ug! We just had the exact same problem. HP ML350 G6. In our case, after applying the 25.04.2.6 update, it would just keep looping batches of these:

dmar_fault: DRHD: Handing fault status reg 302
DMAR: [DMA Read NO_PASID] Request device [00:1e.0] fault addr 0x5081000 [fault reason 0x02] Present bit in context entry is clear

and then some errors about nfs-mountd.service, etc. Machine would ping. Even after 20 minutes, it would not finish boot, no web services, no NFS, no ssh, no multi-console. So I could do nothing and could not login anywhere. Searched and found this thread (THANK YOU).

Could not properly shut down machine either and gave up and hard-power button. Made changes in BIOS and it booted just fine the next time.

Does this mean this machine will never work with Virtualization again?

Firstly, I’m really pleased this thread continues to be of use to people with older hardware!

Secondly, no, your machine can be used for virtualisation just fine with those extensions disabled, you’ll just lose some of the fancy features designed to make it a bit more efficient on those CPUs.

We are more glad that people take the time to post like this. I was in a real panic!

With those two processor extensions/features disabled, under Virtual Machines there are no options anymore. It just says “Virtualization is not supported.”

(Tried before and now to post screenshots and forum refuses, oh well).

That’s odd. I’m on 25.04.2.5, and (with the BIOS features disabled) I don’t see that message on my server. I have no virtual machines setup, but I see “No records have been added yet”.

Clicking “Add” presents me with panel to enter machine settings.

(By the way: I think you have to reach a certain user level on the forums before you can post images)

This is on 25.04.2.6. But no way to add a virtual machine now. :frowning: