On 2025-02-28 04:36, Roger Pau Monné wrote:
On Thu, Feb 27, 2025 at 01:28:11PM -0500, Jason Andryuk wrote:
On 2025-02-27 05:23, Roger Pau Monné wrote:
On Wed, Feb 26, 2025 at 04:11:25PM -0500, Jason Andryuk wrote:
To work around this, we can, for per-device IRTs, program the hardware
to use the guest data & associated IRTE. The address doesn't matter
since the IRTE handles that, and the Xen address & vector can be used as
expected.
All this work on AMD because when interrupt remapping is enabled all
MSIs are handled by the remapping table, while on Intel there's still
a bit in the MSI address field to signal whether the MSI is using a
remapping entry, or is using the "compatibility" format (iow: no
remapping).
So, on Intel, if the guest hands the device the MSI address, it can decided
to bypass remapping?
Thanks for providing insight into the Intel inner workings. That's why I am
asking.
Yes, sorry, I'm afraid I don't have any good solution for Intel, at
least not anything similar to what you propose to do on AMD-Vi. I
guess we could take a partial solution for AMD-Vi only, but it's
sub-optimal from Xen perspective to have a piece of hardware working
fine on AMD and not on Intel.
I only need AMD to work ;)
But yeah, I thought I should make an effort to get both working.
For vPCI, the guest MSI data is available at the time of initial MSI
setup, but that is not the case for HVM. With HVM, the initial MSI
setup is done when PHYSDEVOP_map_pirq is called, but the guest vector is
only available later when XEN_DOMCTL_bind_pt_irq is called. In that
case, we need to tear down and create a new IRTE. This later location
can also handle vPCI.
Add pirq_guest_bind_gvec to plumb down the gvec without modifying all
call sites. Use msi_desc->gvec to pass through the desired value.
So basically the solution is to use the guest selected MSI vector as
the interrupt remapping table index, as then the guest can use the MSI
data and address fields without requiring Xen translation.
What about the guest using the same vector across multiple vCPUs? So
MSI entries having the same vector field, but different target
destination CPUs? That won't work correctly as all those MSIs will
attempt to use the same IRTE?
I think you will also need to add some extra checks to ensure that
when this quirk is active the guest will always set APIC ID 0 as the
interrupt destination for all MSI entries for the affected device, so
that there cannot be vector overlap between CPUs. Otherwise the quirk
won't work as expected.
Ok.
e.g. Replace amd_iommu_perdev_intremap with something generic.
The ath11k device supports and tries to enable 32 MSIs. Linux in PVH
dom0 and HVM domU fails enabling 32 and falls back to just 1, so that is
all that has been tested.
DYK why it fails to enable 32?
Not exactly - someone else had the card. msi_capability_init() failed. If
it ends up in arch_setup_msi_irqs(), only 1 MSI is supported. But precisely
where the mutiple nvecs was denied was not tracked down.
Does it also fail on native? I'm mostly asking because it would be
good to get to the bottom of this, so that we don't come up with a
partial solution that will break if multi-msi is used later in Linux.
My understanding is native and PV dom0 work with 32, and it's Linux
deciding not to use multiple MSI.
It might be this:
static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
{
int irq, pirq;
struct msi_desc *msidesc;
struct msi_msg msg;
if (type == PCI_CAP_ID_MSI && nvec > 1)
return 1;
I'll have to look into this more.
+ new_remap_index = msi_desc->gvec;
+ }
+
+ if ( new_remap_index && new_remap_index != msi_desc->remap_index &&
+ msi_desc->remap_index != -1 )
+ {
+ /* Clear any existing entries */
+ update_intremap_entry_from_msi_msg(iommu, bdf, nr,
+ &msi_desc->remap_index,
+ NULL, NULL);
Why do you need to clear any entries? This will cause a window where
MSI entries targeting this IRTEs to generate faults because the
entries are not setup.
Just re-use them, update_intremap_entry_from_msi_msg() will update the
IRTE atomically so that there's no window where the entries would be
invalid, and thus to faults will be generated.
I see your point about the window. I was trying to keep it clean as
different indices get populated. Initially, IRT[0..n-1] is populated.
Hm, I see. For this specific use-case you are changing the IRT index
when the guest updates the MSI vector. Tearing down of the old
entries would better be done _after_ the MSI entry has been updated,
so that at all times the pointed IRTE is valid.
Later, when the gvec is available, we want IRT[gvec..gvec+n-1] populated. I
guess the new gvec ones could be added, and then 0...gvec-1 removed. Or
don't bother?
Indeed, that would be a better approach, as then the IRTE would always
be valid.
In fact you could possibly leave the old IRTE entries as-is, they
would be unhooked from any MSI entry, and if re-used they would be
setup correctly. For this specific quirk where vector == IRT index
there's never the need to search for a free IRTE, as the guest set
vector will dictate which IRTE to use.
I guess it would be nice to attempt to keep the inuse IRTE bitmap in
sync if possible.
I considered leaving IRTE[0] and adding IRTE[gvec]. I think that could
work, but would be more hacky.
I was trying to keep the irte accounting bitmap correct, but it doesn't
really matter for per-device IRT.
Yes, that's my thinking too. If you can move the call to teardown the
old IRTE after the new one has been setup and the MSI entry has been
updated that would be the best approach IMO.
Ok.
Thanks,
Jason