On 4/18/2025 11:00 AM, CLEMENT MATHIEU--DRIF wrote:


On 17/04/2025 5:27 pm, Alejandro Jimenez wrote:
Caution: External email. Do not open attachments or click links, unless
this email comes from a known sender and you know the content is safe.


On 4/17/25 8:40 AM, CLEMENT MATHIEU--DRIF wrote:


On 14/04/2025 4:02 am, Alejandro Jimenez wrote:
Caution: External email. Do not open attachments or click links,
unless this email comes from a known sender and you know the content
is safe.


The current amdvi_page_walk() is designed to be called by the replay()
method. Rather than drastically altering it, introduce helpers to fetch
guest PTEs that will be used by a page walker implementation.

Signed-off-by: Alejandro Jimenez <alejandro.j.jime...@oracle.com>
---
    hw/i386/amd_iommu.c | 125 ++++++++++++++++++++++++++++++++++++++++
++++
    hw/i386/amd_iommu.h |  42 +++++++++++++++
    2 files changed, 167 insertions(+)

diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 0af873b66a31..d089fdc28ef1 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -1563,6 +1563,131 @@ static const MemoryRegionOps amdvi_ir_ops = {
        }
    };

+/*
+ * For a PTE encoding a large page, return the page size it encodes
as described
+ * by the AMD IOMMU Specification Table 14: Example Page Size
Encodings.
+ * No need to adjust the value of the PTE to point to the first PTE
in the large
+ * page since the encoding guarantees all "base" PTEs in the large
page are the
+ * same.
+ */
+static uint64_t large_pte_page_size(uint64_t pte)
+{
+    assert(PTE_NEXT_LEVEL(pte) == 7);
+
+    /* Determine size of the large/contiguous page encoded in the
PTE */
+    return PTE_LARGE_PAGE_SIZE(pte);
+}
+
+/*
+ * Helper function to fetch a PTE using AMD v1 pgtable format.
+ * Returns:
+ * -2:  The Page Table Root could not be read from DTE, or IOVA is
larger than
+ *      supported by current page table level encodedin DTE[Mode].
+ * -1:  PTE could not be read from guest memory during a page table
walk.
+ *      This means that the DTE has valid data, and one of the lower
level
+ *      entries in the Page Table could not be read.
+ *  0:  PTE is marked not present, or entry is 0.
+ * >0:  Leaf PTE value resolved from walking Guest IO Page Table.
+ */

This seems to be a bit error prone as any statement like "if (pte < 0)"
might be completely removed by the compiler during optimization phases.

Yes, caller(s) of fetch_pte() must cast to uint64_t in any comparison to
check for error values. Like it is the case in many of the patches, I am
following the examples from the VTD implementations where they do the
same thing in vtd_iova_to_slpte() to validate the return of vtd_get_pte().

Yes, I know :)

Note that VT-d only has 1 potential error code (-1) which seems easier
to handle at call site.


When using fetch_pte() again in patch [17/18] I considered adding a
helper to check if fetch_pte() returned a valid PTE, but seemed
unnecessary given that there are only two errors to be checked.

Another choice was to make fetch_pte() return an int so the error check
could be done simply via (pte < 0), since bit 63 is either Reserved
(DTE) or Ignored (PDE/PTE), but that seemed more prone to confusion
since you'd expect a PTE to be a 64-bit long value. Though I am aware
the way error return/checking is implemented essentially relies on that
behavior.

If you want to reuse such "high" values, defines could help.

Sorry, I don't follow. Do you mean using defines as in still returning a
uint64_t but giving -1 and -2 special definitions? That might make the
code a somewhat more readable when checking the error values, but still
requires casting to uint64_t on the check, and doesn't solve the problem
of a careless caller using (pte < 0) to check for errors...

Yes, I think that it would be more readable.
When using defines, the caller no longer needs to be aware of the fact
that the value has been casted from a negative number, which reduces the
risk of writing things like (pte < 0).

I prefer the out parameter solution but let's see what other reviews say.


I think having pte as out parameter is the better solution here.
less error prone and readable !

Regards
Sairaj Kodilkar

Thanks for this patch set :)


Otherwise, pte could be an out parameter.

In general, I think we have to accept the caveat that callers of
fetch_pte() must have some implementation specific knowledge to know
they cannot check for errors using (pte < 0). Maybe with the aid of a
strongly worded warning on the function header comment...

But if that argument doesn't convince you, and none of the alternatives
above seem better, then I am leaning towards using the out parameter
approach.

Thank you for the feedback.
Alejandro


+static uint64_t __attribute__((unused))
+fetch_pte(AMDVIAddressSpace *as, const hwaddr address, uint64_t dte,
+          hwaddr *page_size)
+{
+    IOMMUAccessFlags perms = amdvi_get_perms(dte);
+
+    uint8_t level, mode;
+    uint64_t pte = dte, pte_addr;
+
+    *page_size = 0;
+
+    if (perms == IOMMU_NONE) {
+        return (uint64_t)-2;
+    }
+
+    /*
+     * The Linux kernel driver initializes the default mode to 3,
corresponding
+     * to a 39-bit GPA space, where each entry in the pagetable
translates to a
+     * 1GB (2^30) page size.
+     */
+    level = mode = get_pte_translation_mode(dte);
+    assert(mode > 0 && mode < 7);
+
+    /*
+     * If IOVA is larger than the max supported by the current
pgtable level,
+     * there is nothing to do. This signals that the pagetable level
should be
+     * increased, or is an address meant to have special behavior like
+     * invalidating the entire cache.
+     */
+    if (address > PT_LEVEL_MAX_ADDR(mode - 1)) {
+        /* IOVA too large for the current DTE */
+        return (uint64_t)-2;
+    }
+
+    do {
+        level -= 1;
+
+        /* Update the page_size */
+        *page_size = PTE_LEVEL_PAGE_SIZE(level);
+
+        /* Permission bits are ANDed at every level, including the
DTE */
+        perms &= amdvi_get_perms(pte);
+        if (perms == IOMMU_NONE) {
+            return pte;
+        }
+
+        /* Not Present */
+        if (!IOMMU_PTE_PRESENT(pte)) {
+            return 0;
+        }
+
+        /* Large or Leaf PTE found */
+        if (PTE_NEXT_LEVEL(pte) == 7 || PTE_NEXT_LEVEL(pte) == 0) {
+            /* Leaf PTE found */
+            break;
+        }
+
+        /*
+         * Index the pgtable using the IOVA bits corresponding to
current level
+         * and walk down to the lower level.
+         */
+        pte_addr = NEXT_PTE_ADDR(pte, level, address);
+        pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as-
devfn);
+
+        if (pte == (uint64_t)-1) {
+            /*
+             * A returned PTE of -1 indicates a failure to read the
page table
+             * entry from guest memory.
+             */
+            if (level == mode - 1) {
+                /* Failure to retrieve the Page Table from Root
Pointer */
+                *page_size = 0;
+                return (uint64_t)-2;
+            } else {
+                /* Failure to read PTE. Page walk skips a page_size
chunk */
+                return pte;
+            }
+        }
+    } while (level > 0);
+
+    /*
+     * Page walk ends when Next Level field on PTE shows that either
a leaf PTE
+     * or a series of large PTEs have been reached. In the latter
case, return
+     * the pointer to the first PTE of the series.
+     */
+    assert(level == 0 || PTE_NEXT_LEVEL(pte) == 0 ||
PTE_NEXT_LEVEL(pte) == 7);
+
+    /*
+     * In case the range starts in the middle of a contiguous page,
need to
+     * return the first PTE
+     */
+    if (PTE_NEXT_LEVEL(pte) == 7) {
+        /* Update page_size with the large PTE page size */
+        *page_size = large_pte_page_size(pte);
+    }
+
+    return pte;
+}
+
    /*
     * Toggle between address translation and passthrough modes by
enabling the
     * corresponding memory regions.
diff --git a/hw/i386/amd_iommu.h b/hw/i386/amd_iommu.h
index c89e7dc9947d..fc4d2f7a4575 100644
--- a/hw/i386/amd_iommu.h
+++ b/hw/i386/amd_iommu.h
@@ -25,6 +25,8 @@
    #include "hw/i386/x86-iommu.h"
    #include "qom/object.h"

+#define GENMASK64(h, l)  (((~0ULL) >> (63 - (h) + (l))) << (l))
+
    /* Capability registers */
    #define AMDVI_CAPAB_BAR_LOW           0x04
    #define AMDVI_CAPAB_BAR_HIGH          0x08
@@ -174,6 +176,46 @@
    #define AMDVI_GATS_MODE                 (2ULL <<  12)
    #define AMDVI_HATS_MODE                 (2ULL <<  10)

+/* Page Table format */
+
+#define AMDVI_PTE_PR                    (1ULL << 0)
+#define AMDVI_PTE_NEXT_LEVEL_MASK       GENMASK64(11, 9)
+
+#define IOMMU_PTE_PRESENT(pte)          ((pte) & AMDVI_PTE_PR)
+
+/* Using level=0 for leaf PTE at 4K page size */
+#define PT_LEVEL_SHIFT(level)           (12 + ((level) * 9))
+
+/* Return IOVA bit group used to index the Page Table at specific
level */
+#define PT_LEVEL_INDEX(level, iova)     (((iova) >>
PT_LEVEL_SHIFT(level)) & \
+                                        GENMASK64(8, 0))
+
+/* Return the max address for a specified level i.e. max_oaddr */
+#define PT_LEVEL_MAX_ADDR(x)    (((x) < 5) ? \
+                                ((1ULL << PT_LEVEL_SHIFT((x + 1))) -
1) : \
+                                (~(0ULL)))
+
+/* Extract the NextLevel field from PTE/PDE */
+#define PTE_NEXT_LEVEL(pte)     (((pte) & AMDVI_PTE_NEXT_LEVEL_MASK)
9)
+
+/* Take page table level and return default pagetable size for level */
+#define PTE_LEVEL_PAGE_SIZE(level)      (1ULL <<
(PT_LEVEL_SHIFT(level)))
+
+/*
+ * Return address of lower level page table encoded in PTE and
specified by
+ * current level and corresponding IOVA bit group at such level.
+ */
+#define NEXT_PTE_ADDR(pte, level, iova) (((pte) &
AMDVI_DEV_PT_ROOT_MASK) + \
+                                        (PT_LEVEL_INDEX(level, iova)
* 8))
+
+/*
+ * Take a PTE value with mode=0x07 and return the page size it encodes.
+ */
+#define PTE_LARGE_PAGE_SIZE(pte)    (1ULL << (1 + cto64(((pte) |
0xfffULL))))
+
+/* Return number of PTEs to use for a given page size (expected
power of 2) */
+#define PAGE_SIZE_PTE_COUNT(pgsz)       (1ULL << ((ctz64(pgsz) - 12)
% 9))
+
    /* IOTLB */
    #define AMDVI_IOTLB_MAX_SIZE 1024
    #define AMDVI_DEVID_SHIFT    36
--
2.43.5





Reply via email to