On 17/04/2025 5:27 pm, Alejandro Jimenez wrote: > Caution: External email. Do not open attachments or click links, unless > this email comes from a known sender and you know the content is safe. > > > On 4/17/25 8:40 AM, CLEMENT MATHIEU--DRIF wrote: >> >> >> On 14/04/2025 4:02 am, Alejandro Jimenez wrote: >>> Caution: External email. Do not open attachments or click links, >>> unless this email comes from a known sender and you know the content >>> is safe. >>> >>> >>> The current amdvi_page_walk() is designed to be called by the replay() >>> method. Rather than drastically altering it, introduce helpers to fetch >>> guest PTEs that will be used by a page walker implementation. >>> >>> Signed-off-by: Alejandro Jimenez <alejandro.j.jime...@oracle.com> >>> --- >>> hw/i386/amd_iommu.c | 125 ++++++++++++++++++++++++++++++++++++++++ >>> ++++ >>> hw/i386/amd_iommu.h | 42 +++++++++++++++ >>> 2 files changed, 167 insertions(+) >>> >>> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c >>> index 0af873b66a31..d089fdc28ef1 100644 >>> --- a/hw/i386/amd_iommu.c >>> +++ b/hw/i386/amd_iommu.c >>> @@ -1563,6 +1563,131 @@ static const MemoryRegionOps amdvi_ir_ops = { >>> } >>> }; >>> >>> +/* >>> + * For a PTE encoding a large page, return the page size it encodes >>> as described >>> + * by the AMD IOMMU Specification Table 14: Example Page Size >>> Encodings. >>> + * No need to adjust the value of the PTE to point to the first PTE >>> in the large >>> + * page since the encoding guarantees all "base" PTEs in the large >>> page are the >>> + * same. >>> + */ >>> +static uint64_t large_pte_page_size(uint64_t pte) >>> +{ >>> + assert(PTE_NEXT_LEVEL(pte) == 7); >>> + >>> + /* Determine size of the large/contiguous page encoded in the >>> PTE */ >>> + return PTE_LARGE_PAGE_SIZE(pte); >>> +} >>> + >>> +/* >>> + * Helper function to fetch a PTE using AMD v1 pgtable format. >>> + * Returns: >>> + * -2: The Page Table Root could not be read from DTE, or IOVA is >>> larger than >>> + * supported by current page table level encodedin DTE[Mode]. >>> + * -1: PTE could not be read from guest memory during a page table >>> walk. >>> + * This means that the DTE has valid data, and one of the lower >>> level >>> + * entries in the Page Table could not be read. >>> + * 0: PTE is marked not present, or entry is 0. >>> + * >0: Leaf PTE value resolved from walking Guest IO Page Table. >>> + */ >> >> This seems to be a bit error prone as any statement like "if (pte < 0)" >> might be completely removed by the compiler during optimization phases. > > Yes, caller(s) of fetch_pte() must cast to uint64_t in any comparison to > check for error values. Like it is the case in many of the patches, I am > following the examples from the VTD implementations where they do the > same thing in vtd_iova_to_slpte() to validate the return of vtd_get_pte().
Yes, I know :) Note that VT-d only has 1 potential error code (-1) which seems easier to handle at call site. > > When using fetch_pte() again in patch [17/18] I considered adding a > helper to check if fetch_pte() returned a valid PTE, but seemed > unnecessary given that there are only two errors to be checked. > > Another choice was to make fetch_pte() return an int so the error check > could be done simply via (pte < 0), since bit 63 is either Reserved > (DTE) or Ignored (PDE/PTE), but that seemed more prone to confusion > since you'd expect a PTE to be a 64-bit long value. Though I am aware > the way error return/checking is implemented essentially relies on that > behavior. > >> If you want to reuse such "high" values, defines could help. > > Sorry, I don't follow. Do you mean using defines as in still returning a > uint64_t but giving -1 and -2 special definitions? That might make the > code a somewhat more readable when checking the error values, but still > requires casting to uint64_t on the check, and doesn't solve the problem > of a careless caller using (pte < 0) to check for errors... Yes, I think that it would be more readable. When using defines, the caller no longer needs to be aware of the fact that the value has been casted from a negative number, which reduces the risk of writing things like (pte < 0). I prefer the out parameter solution but let's see what other reviews say. Thanks for this patch set :) > >> Otherwise, pte could be an out parameter. > > In general, I think we have to accept the caveat that callers of > fetch_pte() must have some implementation specific knowledge to know > they cannot check for errors using (pte < 0). Maybe with the aid of a > strongly worded warning on the function header comment... > > But if that argument doesn't convince you, and none of the alternatives > above seem better, then I am leaning towards using the out parameter > approach. > > Thank you for the feedback. > Alejandro > >> >>> +static uint64_t __attribute__((unused)) >>> +fetch_pte(AMDVIAddressSpace *as, const hwaddr address, uint64_t dte, >>> + hwaddr *page_size) >>> +{ >>> + IOMMUAccessFlags perms = amdvi_get_perms(dte); >>> + >>> + uint8_t level, mode; >>> + uint64_t pte = dte, pte_addr; >>> + >>> + *page_size = 0; >>> + >>> + if (perms == IOMMU_NONE) { >>> + return (uint64_t)-2; >>> + } >>> + >>> + /* >>> + * The Linux kernel driver initializes the default mode to 3, >>> corresponding >>> + * to a 39-bit GPA space, where each entry in the pagetable >>> translates to a >>> + * 1GB (2^30) page size. >>> + */ >>> + level = mode = get_pte_translation_mode(dte); >>> + assert(mode > 0 && mode < 7); >>> + >>> + /* >>> + * If IOVA is larger than the max supported by the current >>> pgtable level, >>> + * there is nothing to do. This signals that the pagetable level >>> should be >>> + * increased, or is an address meant to have special behavior like >>> + * invalidating the entire cache. >>> + */ >>> + if (address > PT_LEVEL_MAX_ADDR(mode - 1)) { >>> + /* IOVA too large for the current DTE */ >>> + return (uint64_t)-2; >>> + } >>> + >>> + do { >>> + level -= 1; >>> + >>> + /* Update the page_size */ >>> + *page_size = PTE_LEVEL_PAGE_SIZE(level); >>> + >>> + /* Permission bits are ANDed at every level, including the >>> DTE */ >>> + perms &= amdvi_get_perms(pte); >>> + if (perms == IOMMU_NONE) { >>> + return pte; >>> + } >>> + >>> + /* Not Present */ >>> + if (!IOMMU_PTE_PRESENT(pte)) { >>> + return 0; >>> + } >>> + >>> + /* Large or Leaf PTE found */ >>> + if (PTE_NEXT_LEVEL(pte) == 7 || PTE_NEXT_LEVEL(pte) == 0) { >>> + /* Leaf PTE found */ >>> + break; >>> + } >>> + >>> + /* >>> + * Index the pgtable using the IOVA bits corresponding to >>> current level >>> + * and walk down to the lower level. >>> + */ >>> + pte_addr = NEXT_PTE_ADDR(pte, level, address); >>> + pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as- >>> >devfn); >>> + >>> + if (pte == (uint64_t)-1) { >>> + /* >>> + * A returned PTE of -1 indicates a failure to read the >>> page table >>> + * entry from guest memory. >>> + */ >>> + if (level == mode - 1) { >>> + /* Failure to retrieve the Page Table from Root >>> Pointer */ >>> + *page_size = 0; >>> + return (uint64_t)-2; >>> + } else { >>> + /* Failure to read PTE. Page walk skips a page_size >>> chunk */ >>> + return pte; >>> + } >>> + } >>> + } while (level > 0); >>> + >>> + /* >>> + * Page walk ends when Next Level field on PTE shows that either >>> a leaf PTE >>> + * or a series of large PTEs have been reached. In the latter >>> case, return >>> + * the pointer to the first PTE of the series. >>> + */ >>> + assert(level == 0 || PTE_NEXT_LEVEL(pte) == 0 || >>> PTE_NEXT_LEVEL(pte) == 7); >>> + >>> + /* >>> + * In case the range starts in the middle of a contiguous page, >>> need to >>> + * return the first PTE >>> + */ >>> + if (PTE_NEXT_LEVEL(pte) == 7) { >>> + /* Update page_size with the large PTE page size */ >>> + *page_size = large_pte_page_size(pte); >>> + } >>> + >>> + return pte; >>> +} >>> + >>> /* >>> * Toggle between address translation and passthrough modes by >>> enabling the >>> * corresponding memory regions. >>> diff --git a/hw/i386/amd_iommu.h b/hw/i386/amd_iommu.h >>> index c89e7dc9947d..fc4d2f7a4575 100644 >>> --- a/hw/i386/amd_iommu.h >>> +++ b/hw/i386/amd_iommu.h >>> @@ -25,6 +25,8 @@ >>> #include "hw/i386/x86-iommu.h" >>> #include "qom/object.h" >>> >>> +#define GENMASK64(h, l) (((~0ULL) >> (63 - (h) + (l))) << (l)) >>> + >>> /* Capability registers */ >>> #define AMDVI_CAPAB_BAR_LOW 0x04 >>> #define AMDVI_CAPAB_BAR_HIGH 0x08 >>> @@ -174,6 +176,46 @@ >>> #define AMDVI_GATS_MODE (2ULL << 12) >>> #define AMDVI_HATS_MODE (2ULL << 10) >>> >>> +/* Page Table format */ >>> + >>> +#define AMDVI_PTE_PR (1ULL << 0) >>> +#define AMDVI_PTE_NEXT_LEVEL_MASK GENMASK64(11, 9) >>> + >>> +#define IOMMU_PTE_PRESENT(pte) ((pte) & AMDVI_PTE_PR) >>> + >>> +/* Using level=0 for leaf PTE at 4K page size */ >>> +#define PT_LEVEL_SHIFT(level) (12 + ((level) * 9)) >>> + >>> +/* Return IOVA bit group used to index the Page Table at specific >>> level */ >>> +#define PT_LEVEL_INDEX(level, iova) (((iova) >> >>> PT_LEVEL_SHIFT(level)) & \ >>> + GENMASK64(8, 0)) >>> + >>> +/* Return the max address for a specified level i.e. max_oaddr */ >>> +#define PT_LEVEL_MAX_ADDR(x) (((x) < 5) ? \ >>> + ((1ULL << PT_LEVEL_SHIFT((x + 1))) - >>> 1) : \ >>> + (~(0ULL))) >>> + >>> +/* Extract the NextLevel field from PTE/PDE */ >>> +#define PTE_NEXT_LEVEL(pte) (((pte) & AMDVI_PTE_NEXT_LEVEL_MASK) >>> >> 9) >>> + >>> +/* Take page table level and return default pagetable size for level */ >>> +#define PTE_LEVEL_PAGE_SIZE(level) (1ULL << >>> (PT_LEVEL_SHIFT(level))) >>> + >>> +/* >>> + * Return address of lower level page table encoded in PTE and >>> specified by >>> + * current level and corresponding IOVA bit group at such level. >>> + */ >>> +#define NEXT_PTE_ADDR(pte, level, iova) (((pte) & >>> AMDVI_DEV_PT_ROOT_MASK) + \ >>> + (PT_LEVEL_INDEX(level, iova) >>> * 8)) >>> + >>> +/* >>> + * Take a PTE value with mode=0x07 and return the page size it encodes. >>> + */ >>> +#define PTE_LARGE_PAGE_SIZE(pte) (1ULL << (1 + cto64(((pte) | >>> 0xfffULL)))) >>> + >>> +/* Return number of PTEs to use for a given page size (expected >>> power of 2) */ >>> +#define PAGE_SIZE_PTE_COUNT(pgsz) (1ULL << ((ctz64(pgsz) - 12) >>> % 9)) >>> + >>> /* IOTLB */ >>> #define AMDVI_IOTLB_MAX_SIZE 1024 >>> #define AMDVI_DEVID_SHIFT 36 >>> -- >>> 2.43.5 >>> >>> >