On Monday, 23 June 2025 09:55:46 CEST Jan Beulich wrote:
> On 21.06.2025 16:39, J. Roeleveld wrote:
> > I managed to get past the kernel panic (sort of) by doing the following:
> > 
> > 1) Ensure system is fully OFF before booting. A reset/reboot will cause
> > these errors.
> > 
> > 2) Fix the BIOS config to ensure the PCI-ports are split correctly. If
> > anyone has a Supermicro board and gets errors about PCI-slots not getting
> > full speed let me know.
> > 
> > Not entirely convinced the 2nd was part of the cause, but that's ok.
> > 
> > I now, however, get a new error message in the Domain0 dmesg:
> > pciback <pci-address>: xen_map irq failed -28 for <domid> domain
> > pciback <pci-address>: error enabling MSI-X for guest <domid>: err -28!
> > 
> > For the NVMe devices, I get these twice, with the 2nd time complaining
> > about MSI (without the -X)
> > 
> > I feel there is something missing in my kernel-config and/or domain
> > config.
> > If anyone can point me at what needs to be enabled/disabled or suggestions
> > on what I can try?
> 
> The default number of extra IRQs the guest may (have) set up may be too
> small. You may need to make use of Xen's extra_guest_irqs= command line
> option.

I spent the entire weekend searching for possible causes/hints/things to try.
That setting was one I had found some time ago (I think for MSI/MSI-X issues) 
and it's currently set to:
extra_guest_irqs=768,1024

Not sure if it makes sense to increase this further?

# For completeness, the Xen commandline is:
dom0_mem=24576M,max:24576M dom0_max_vcpus=4 dom0_vcpus_pin 
gnttab_max_frames=512 sched=credit console=vga extra_guest_irqs=768,1024 
iommu=verbose

# The kernel commandline is:
kernel=gentoo-6.12.21.efi dozfs root=ZFS=zhost/host/root by=id elevator=noop 
logo.nologo triggers=zfs quiet refresh softlevel=prexen nomodeset 
nfs.callback_tcpport=32764 lockd.nlm_udpport=32768 lockd.nlm_tcpport=32768 
xen-pciback.hide=(83:00.0)(84:00.0)(85:00.0)(86:00.0) xen-
pciback.passthrough=1

If there is anything I am missing or should be doing differently, please let me 
know.

As said, I spent the entire weekend search with google and duckduckgo (ddg 
seems to return more relevant results, but also has few results with similar 
error message that are more recent then 6+ years). Here are what I found out 
so far:

== 1) NVMe errors in dmesg:

I noticed that, even when working, the NVMe drivers show issues with "MSI-X" )
(only showing 1 of the 2, the other has the same messages):
[    7.742006] nvme nvme0: pci function 0000:84:00.0
[    7.742158] nvme 0000:84:00.0: Xen PCI mapped GSI56 to IRQ59
[    7.752907] nvme nvme0: D3 entry latency set to 8 seconds
[    8.003806] nvme nvme0: allocated 64 MiB host memory buffer.
[    8.038746] nvme 0000:84:00.0: enable msix get err ffffff8e
[    8.038756] nvme 0000:84:00.0: Xen PCI frontend error: -114!
[    8.048849] nvme nvme0: 1/0/0 default/read/poll queues
[    8.106017]  nvme0n1: p1 p2 p3 p4

I have been unable to find what " enable msix get err ffffff8e " and " Xen PCI 
frontend error: -114! " actually mean.

These messages show up on a "working" environment. This is with kernel 6.12.21 
and Xen 4.18.4_pre1

== 2) A BIOS setting where default differs from general recommendation by 
Supermicro:
- Setting " MMIO High Size " is set to 256GB. Recommendation I see is to set 
this to 1024GB.
Note: The server has 368 GB ram, so does make sense. I will be changing this 
setting during my next chance to do further testing.

== 3) IOMMU seems enabled, apart from 2 items:
(XEN) Intel VT-d iommu 0 supported page sizes: 4kB, 2MB, 1GB
(XEN) Intel VT-d iommu 1 supported page sizes: 4kB, 2MB, 1GB
(XEN) Intel VT-d Snoop Control enabled.
(XEN) Intel VT-d Dom0 DMA Passthrough not enabled. <---
(XEN) Intel VT-d Queued Invalidation enabled.
(XEN) Intel VT-d Interrupt Remapping enabled.
(XEN) Intel VT-d Posted Interrupt not enabled.  <---
(XEN) Intel VT-d Shared EPT tables enabled.
(XEN) I/O virtualisation enabled
(XEN)  - Dom0 mode: Relaxed
(XEN) Interrupt remapping enabled

The 2 marked lines say "not enabled", if I understand all the different 
documentation correctly, this is not an issue. Please let me know if I am 
mistaken.

== 4) "nr_irqs" (and this is making me wonder if the "extra_guest_irqs" is 
actually used

In the Dmesg on the host I see:
[    2.328651] NR_IRQS: 8448, nr_irqs: 1024, preallocated irqs: 16

On the VM/Domain I see:
[    3.673555] NR_IRQS: 4352, nr_irqs: 80, preallocated irqs: 0

The number on the host matches.
The number in the Domain does not.

The specific domain is always the 2nd that is started.



Reply via email to