On 10/4/24 5:33 AM, Andrew Jones wrote:
On Thu, Oct 03, 2024 at 11:36:00AM GMT, Tomasz Jeznach wrote:
On Thu, Oct 3, 2024 at 6:06 AM Daniel Henrique Barboza
<dbarb...@ventanamicro.com> wrote:



On 10/3/24 6:26 AM, Andrew Jones wrote:
On Tue, Oct 01, 2024 at 10:02:58PM GMT, Daniel Henrique Barboza wrote:
...
+/*
+ * RISCV IOMMU Address Translation Lookup - Page Table Walk
+ *
+ * Note: Code is based on get_physical_address() from target/riscv/cpu_helper.c
+ * Both implementation can be merged into single helper function in future.
+ * Keeping them separate for now, as error reporting and flow specifics are
+ * sufficiently different for separate implementation.
+ *
+ * @s        : IOMMU Device State
+ * @ctx      : Translation context for device id and process address space id.
+ * @iotlb    : translation data: physical address and access mode.
+ * @return   : success or fault cause code.
+ */
+static int riscv_iommu_spa_fetch(RISCVIOMMUState *s, RISCVIOMMUContext *ctx,
+    IOMMUTLBEntry *iotlb)
+{
+    dma_addr_t addr, base;
+    uint64_t satp, gatp, pte;
+    bool en_s, en_g;
+    struct {
+        unsigned char step;
+        unsigned char levels;
+        unsigned char ptidxbits;
+        unsigned char ptesize;
+    } sc[2];
+    /* Translation stage phase */
+    enum {
+        S_STAGE = 0,
+        G_STAGE = 1,
+    } pass;
+    MemTxResult ret;
+
+    satp = get_field(ctx->satp, RISCV_IOMMU_ATP_MODE_FIELD);
+    gatp = get_field(ctx->gatp, RISCV_IOMMU_ATP_MODE_FIELD);
+
+    en_s = satp != RISCV_IOMMU_DC_FSC_MODE_BARE;
+    en_g = gatp != RISCV_IOMMU_DC_IOHGATP_MODE_BARE;
+
+    /*
+     * Early check for MSI address match when IOVA == GPA. This check
+     * is required to ensure MSI translation is applied in case
+     * first-stage translation is set to BARE mode. In this case IOVA
+     * provided is a valid GPA. Running translation through page walk
+     * second stage translation will incorrectly try to translate GPA
+     * to host physical page, likely hitting IOPF.
+     */

Why was this comment expanded from the simple

"Early check for MSI address match when IOVA == GPA."

The comment is now incorrect since the check is required even when
first-stage translation is not BARE. I just skimmed the spec again trying
to figure out if the removal of '!en_s' is a hack or a fix, and I'm
inclined to say "fix", but it's an incomplete fix. I found a sentence that
says

"If the virtual memory scheme selected for first-stage is Bare but the
scheme for the second-stage is not Bare then the IOVA is a GPA."

which, in a way, defines a GPA to only be a GPA when second-stage is used
(and all MSI translation specifications refer to GPAs). However, maybe I
missed it, but I couldn't find any actual reason that the MSI table can't
be used when first-stage is not BARE and second-stage is (and, of course,
it makes no difference for single-stage translations to call IOVAs GPAs
or not).

Now, I also see

"If the virtual memory scheme selected for neither stage is Bare then the
IOVA is a VA. Two-stage address translation is in effect. The first-stage
translates the VA to a GPA and the second-stage translates the GPA to a
SPA."

in the spec, which means we should probably change the removal of '!en_s'
to '!(en_s && en_g)'. VFIO+irqbypass would still work with that and, when
Linux learns to support two-stage translation, we wouldn't incorrectly try
to check for a GVA in the MSI table.

Ok. It seems to me that we can't rely on the riscv-iommu spec alone to let
us know how to detect if IOVA == GPA, given that one of the main usages
we have ATM (VFIO irqbypass) will use GPAs with S-stage enabled.

(Note: shouldn't we open a bug against the riscv-iommu spec?)

I like the idea of ruling out the case where IOVA == VA since that is clear
in the spec. We also know that MSI entries will always contains GPAs.

This is the change I'm going to make in v9:



It might be worth it to quote spec definitions of IOVA, GPA, SPA.

IOVA: I/O Virtual Address: Virtual address for DMA by devices
GPA: Guest Physical Address: An address in the virtualized physical
memory space of a virtual machine.
SPA: Supervisor Physical Address: Physical address used to access
memory and memory-mapped resources.

 From IOMMU specification it's pretty clear, an untranslated address is
always an IOVA.

If you look at the introduction section and section 2.3. "Process to
translate an IOVA", translation (if enabled by the device/process
context) to GPA, SPA is expressed as:

step #17: GPA = FSC[IOVA]
step #19: SPA = IOHGATP[GPA]

With that, MSI address translations always use GPA as an input
(section 2.3.3. "Process to translate addresses of MSIs"), and is done
as step #18 of the IOVA translation.

And now:
If FSC.Mode == Bare, step #17 translation is simple GPA = IOVA
If IOHGATP.Mode == Bare, step #19 translation is simple SPA = GPA

If both FSC and IOHGATP modes are bare, no address translation is
performed, and SPA = GPA = IOVA. However if "MSI page tables are
enabled, then the MSI address translation process specified in Section
2.3.3 is invoked".  This was the reason to put the
riscv_iommu_msi_check() function before check for pass-through mode
check.

If FSC.Mode == Bare and untranslated address given to the function
riscv_iommu_spa_fetch() is an GPA = IOVA. It could also be SPA if
IOHGATP.Mode == Bare, but it is irrelevant for the MSI address
translation.  That is the reason to put only the check for !en_s in
before riscv_iommu_msi_check() call.

I'm aware that VFIO uses S-Stage translation only, but it does not
change the hardware logic. For such cases VFIO can provide IMSIC
register mappings in S-Stage PT to emulate MSI doorbell registers (if
needed). When S-Stage translation is used, an untranslated address
processed by the IOMMU hardware is still an IOVA, even if the guest OS
/ VFIO user treats it as PA.

This would work for guest interrupt file mappings, but MRIFs won't work
and the note of the "Device contexts at an IOMMU" section of the AIA spec
points out that VCPU migration may become more difficult to manage.


Modification of the check to include en_g ( !(en_s && en_g) && ,,, )
will incorrectly enable MSI address translation for an IOVA address
(as defined in IOMMU spec) if G-Stage translation is set to Bare mode.

I'd suggest to keep the s/g stage checks here as:
       if ((iotlb->perm & IOMMU_WO) && !en_s &&
           riscv_iommu_msi_check(s, ctx, iotlb->iova)) {
           ...
       }

Hope that helps this discussion.

Yes, it certainly helps, thank you. It points out that step 17 of "Process
to translate an IOVA" produces a GPA which is consumed by step 18 for the
MSI translation (and there's no other way to get to step 18). OIOW, we
must do a first-stage translation before we check the MSI page table,
which means we can't do an early MSI check unless we have strictly !en_s.

This is unfortunate, considering the current state of VFIO+irqbypass and
the IOMMU driver. I also wonder if it's really necessary. It seems to me
that the MSI page table should always be checked first when either S-stage
or G-stage is Bare (so there's only a single-stage of translation) and
msiptp.MODE is not Off. I can't see what harm could come from that and it
would give system software the ability to leverage code that already knows
how to manage single-stage IOMMUs for device assignment.

I agree with all of that. But at the end of the day we need to adhere to the
HW spec as closely as we can, and it's now clear that the assumptions we were
making in this particular check, for the sake of VFIO+irqbypass support, is
sort of spec violation.

I still believe this is something that we would like to discuss in the spec
space, but for now we don't have much to do w.r.t the emulation. If we want
VFIO+irqbypass to work the Linux driver will need to do extra work.

This is what I'll do for v9:

   /*
     * Early check for MSI address match when IOVA == GPA.
     * Note that the (!en_s) condition means that the MSI
     * page table may only be used when guest pages are
     * mapped using the g-stage page table, whether single-
     * or two-stage paging is enabled. It's unavoidable though,
     * because the spec mandates that we do a first-stage
     * translation before we check the MSI page table, which
     * means we can't do an early MSI check unless we have
     * strictly !en_s.
     */
    if (!en_s && (iotlb->perm & IOMMU_WO) &&
        riscv_iommu_msi_check(s, ctx, iotlb->iova)) {
        iotlb->target_as = &s->trap_as;
        iotlb->translated_addr = iotlb->iova;
        iotlb->addr_mask = ~TARGET_PAGE_MASK;
        return 0;
    }


It also came to my attention today that the kernel support is waiting for
the QEMU pieces to land due to the Red Hat PCI ID we're adding. I'll post
v9 later today or next Monday to see if we can speed things up.


Thanks,

Daniel






Thanks,
drew

Reply via email to