On Fri, Oct 4, 2024 at 6:00 AM Daniel Henrique Barboza
<dbarb...@ventanamicro.com> wrote:
>
>
>
> On 10/4/24 5:33 AM, Andrew Jones wrote:
> > On Thu, Oct 03, 2024 at 11:36:00AM GMT, Tomasz Jeznach wrote:
> >> On Thu, Oct 3, 2024 at 6:06 AM Daniel Henrique Barboza
> >> <dbarb...@ventanamicro.com> wrote:
> >>>
> >>>
> >>>
> >>> On 10/3/24 6:26 AM, Andrew Jones wrote:
> >>>> On Tue, Oct 01, 2024 at 10:02:58PM GMT, Daniel Henrique Barboza wrote:
> >>>> ...
> >>>>> +/*
> >>>>> + * RISCV IOMMU Address Translation Lookup - Page Table Walk
> >>>>> + *
> >>>>> + * Note: Code is based on get_physical_address() from 
> >>>>> target/riscv/cpu_helper.c
> >>>>> + * Both implementation can be merged into single helper function in 
> >>>>> future.
> >>>>> + * Keeping them separate for now, as error reporting and flow 
> >>>>> specifics are
> >>>>> + * sufficiently different for separate implementation.
> >>>>> + *
> >>>>> + * @s        : IOMMU Device State
> >>>>> + * @ctx      : Translation context for device id and process address 
> >>>>> space id.
> >>>>> + * @iotlb    : translation data: physical address and access mode.
> >>>>> + * @return   : success or fault cause code.
> >>>>> + */
> >>>>> +static int riscv_iommu_spa_fetch(RISCVIOMMUState *s, RISCVIOMMUContext 
> >>>>> *ctx,
> >>>>> +    IOMMUTLBEntry *iotlb)
> >>>>> +{
> >>>>> +    dma_addr_t addr, base;
> >>>>> +    uint64_t satp, gatp, pte;
> >>>>> +    bool en_s, en_g;
> >>>>> +    struct {
> >>>>> +        unsigned char step;
> >>>>> +        unsigned char levels;
> >>>>> +        unsigned char ptidxbits;
> >>>>> +        unsigned char ptesize;
> >>>>> +    } sc[2];
> >>>>> +    /* Translation stage phase */
> >>>>> +    enum {
> >>>>> +        S_STAGE = 0,
> >>>>> +        G_STAGE = 1,
> >>>>> +    } pass;
> >>>>> +    MemTxResult ret;
> >>>>> +
> >>>>> +    satp = get_field(ctx->satp, RISCV_IOMMU_ATP_MODE_FIELD);
> >>>>> +    gatp = get_field(ctx->gatp, RISCV_IOMMU_ATP_MODE_FIELD);
> >>>>> +
> >>>>> +    en_s = satp != RISCV_IOMMU_DC_FSC_MODE_BARE;
> >>>>> +    en_g = gatp != RISCV_IOMMU_DC_IOHGATP_MODE_BARE;
> >>>>> +
> >>>>> +    /*
> >>>>> +     * Early check for MSI address match when IOVA == GPA. This check
> >>>>> +     * is required to ensure MSI translation is applied in case
> >>>>> +     * first-stage translation is set to BARE mode. In this case IOVA
> >>>>> +     * provided is a valid GPA. Running translation through page walk
> >>>>> +     * second stage translation will incorrectly try to translate GPA
> >>>>> +     * to host physical page, likely hitting IOPF.
> >>>>> +     */
> >>>>
> >>>> Why was this comment expanded from the simple
> >>>>
> >>>> "Early check for MSI address match when IOVA == GPA."
> >>>>
> >>>> The comment is now incorrect since the check is required even when
> >>>> first-stage translation is not BARE. I just skimmed the spec again trying
> >>>> to figure out if the removal of '!en_s' is a hack or a fix, and I'm
> >>>> inclined to say "fix", but it's an incomplete fix. I found a sentence 
> >>>> that
> >>>> says
> >>>>
> >>>> "If the virtual memory scheme selected for first-stage is Bare but the
> >>>> scheme for the second-stage is not Bare then the IOVA is a GPA."
> >>>>
> >>>> which, in a way, defines a GPA to only be a GPA when second-stage is used
> >>>> (and all MSI translation specifications refer to GPAs). However, maybe I
> >>>> missed it, but I couldn't find any actual reason that the MSI table can't
> >>>> be used when first-stage is not BARE and second-stage is (and, of course,
> >>>> it makes no difference for single-stage translations to call IOVAs GPAs
> >>>> or not).
> >>>>
> >>>> Now, I also see
> >>>>
> >>>> "If the virtual memory scheme selected for neither stage is Bare then the
> >>>> IOVA is a VA. Two-stage address translation is in effect. The first-stage
> >>>> translates the VA to a GPA and the second-stage translates the GPA to a
> >>>> SPA."
> >>>>
> >>>> in the spec, which means we should probably change the removal of '!en_s'
> >>>> to '!(en_s && en_g)'. VFIO+irqbypass would still work with that and, when
> >>>> Linux learns to support two-stage translation, we wouldn't incorrectly 
> >>>> try
> >>>> to check for a GVA in the MSI table.
> >>>
> >>> Ok. It seems to me that we can't rely on the riscv-iommu spec alone to let
> >>> us know how to detect if IOVA == GPA, given that one of the main usages
> >>> we have ATM (VFIO irqbypass) will use GPAs with S-stage enabled.
> >>>
> >>> (Note: shouldn't we open a bug against the riscv-iommu spec?)
> >>>
> >>> I like the idea of ruling out the case where IOVA == VA since that is 
> >>> clear
> >>> in the spec. We also know that MSI entries will always contains GPAs.
> >>>
> >>> This is the change I'm going to make in v9:
> >>>
> >>>
> >>
> >> It might be worth it to quote spec definitions of IOVA, GPA, SPA.
> >>
> >> IOVA: I/O Virtual Address: Virtual address for DMA by devices
> >> GPA: Guest Physical Address: An address in the virtualized physical
> >> memory space of a virtual machine.
> >> SPA: Supervisor Physical Address: Physical address used to access
> >> memory and memory-mapped resources.
> >>
> >>  From IOMMU specification it's pretty clear, an untranslated address is
> >> always an IOVA.
> >>
> >> If you look at the introduction section and section 2.3. "Process to
> >> translate an IOVA", translation (if enabled by the device/process
> >> context) to GPA, SPA is expressed as:
> >>
> >> step #17: GPA = FSC[IOVA]
> >> step #19: SPA = IOHGATP[GPA]
> >>
> >> With that, MSI address translations always use GPA as an input
> >> (section 2.3.3. "Process to translate addresses of MSIs"), and is done
> >> as step #18 of the IOVA translation.
> >>
> >> And now:
> >> If FSC.Mode == Bare, step #17 translation is simple GPA = IOVA
> >> If IOHGATP.Mode == Bare, step #19 translation is simple SPA = GPA
> >>
> >> If both FSC and IOHGATP modes are bare, no address translation is
> >> performed, and SPA = GPA = IOVA. However if "MSI page tables are
> >> enabled, then the MSI address translation process specified in Section
> >> 2.3.3 is invoked".  This was the reason to put the
> >> riscv_iommu_msi_check() function before check for pass-through mode
> >> check.
> >>
> >> If FSC.Mode == Bare and untranslated address given to the function
> >> riscv_iommu_spa_fetch() is an GPA = IOVA. It could also be SPA if
> >> IOHGATP.Mode == Bare, but it is irrelevant for the MSI address
> >> translation.  That is the reason to put only the check for !en_s in
> >> before riscv_iommu_msi_check() call.
> >>
> >> I'm aware that VFIO uses S-Stage translation only, but it does not
> >> change the hardware logic. For such cases VFIO can provide IMSIC
> >> register mappings in S-Stage PT to emulate MSI doorbell registers (if
> >> needed). When S-Stage translation is used, an untranslated address
> >> processed by the IOMMU hardware is still an IOVA, even if the guest OS
> >> / VFIO user treats it as PA.
> >
> > This would work for guest interrupt file mappings, but MRIFs won't work
> > and the note of the "Device contexts at an IOMMU" section of the AIA spec
> > points out that VCPU migration may become more difficult to manage.
> >
> >>
> >> Modification of the check to include en_g ( !(en_s && en_g) && ,,, )
> >> will incorrectly enable MSI address translation for an IOVA address
> >> (as defined in IOMMU spec) if G-Stage translation is set to Bare mode.
> >>
> >> I'd suggest to keep the s/g stage checks here as:
> >>        if ((iotlb->perm & IOMMU_WO) && !en_s &&
> >>            riscv_iommu_msi_check(s, ctx, iotlb->iova)) {
> >>            ...
> >>        }
> >>
> >> Hope that helps this discussion.
> >
> > Yes, it certainly helps, thank you. It points out that step 17 of "Process
> > to translate an IOVA" produces a GPA which is consumed by step 18 for the
> > MSI translation (and there's no other way to get to step 18). OIOW, we
> > must do a first-stage translation before we check the MSI page table,
> > which means we can't do an early MSI check unless we have strictly !en_s.
> >
> > This is unfortunate, considering the current state of VFIO+irqbypass and
> > the IOMMU driver. I also wonder if it's really necessary. It seems to me
> > that the MSI page table should always be checked first when either S-stage
> > or G-stage is Bare (so there's only a single-stage of translation) and
> > msiptp.MODE is not Off. I can't see what harm could come from that and it
> > would give system software the ability to leverage code that already knows
> > how to manage single-stage IOMMUs for device assignment.
>
> I agree with all of that. But at the end of the day we need to adhere to the
> HW spec as closely as we can, and it's now clear that the assumptions we were
> making in this particular check, for the sake of VFIO+irqbypass support, is
> sort of spec violation.
>
> I still believe this is something that we would like to discuss in the spec
> space, but for now we don't have much to do w.r.t the emulation. If we want
> VFIO+irqbypass to work the Linux driver will need to do extra work.
>
> This is what I'll do for v9:
>
>     /*
>       * Early check for MSI address match when IOVA == GPA.
>       * Note that the (!en_s) condition means that the MSI
>       * page table may only be used when guest pages are
>       * mapped using the g-stage page table, whether single-
>       * or two-stage paging is enabled. It's unavoidable though,
>       * because the spec mandates that we do a first-stage
>       * translation before we check the MSI page table, which
>       * means we can't do an early MSI check unless we have
>       * strictly !en_s.
>       */
>      if (!en_s && (iotlb->perm & IOMMU_WO) &&
>          riscv_iommu_msi_check(s, ctx, iotlb->iova)) {
>          iotlb->target_as = &s->trap_as;
>          iotlb->translated_addr = iotlb->iova;
>          iotlb->addr_mask = ~TARGET_PAGE_MASK;
>          return 0;
>      }
>
>
> It also came to my attention today that the kernel support is waiting for
> the QEMU pieces to land due to the Red Hat PCI ID we're adding. I'll post
> v9 later today or next Monday to see if we can speed things up.
>

Thanks. I'll be sending out kernel iommu/riscv v9 patches later today.
Updated PCI IDs and few minor bugs fixed,
All re-tested with QEMU riscv/iommu-v9 (fetched from github).

Best,
- Tomasz

>
> Thanks,
>
> Daniel
>
>
>
>
>
> >
> > Thanks,
> > drew

Reply via email to