riscv: implement p2m_next_level()

Oleksii Kurochko Wed, 16 Jul 2025 04:32:09 -0700


On 7/2/25 10:35 AM, Jan Beulich wrote:

On 10.06.2025 15:05, Oleksii Kurochko wrote:

--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -387,6 +387,17 @@ static inline bool p2me_is_valid(struct p2m_domain *p2m, 
pte_t pte)
      return p2m_type_radix_get(p2m, pte) != p2m_invalid;
  }

+/*

+ * pte_is_* helpers are checking the valid bit set in the
+ * PTE but we have to check p2m_type instead (look at the comment above
+ * p2me_is_valid())
+ * Provide our own overlay to check the valid bit.
+ */
+static inline bool p2me_is_mapping(struct p2m_domain *p2m, pte_t pte)
+{
+    return p2me_is_valid(p2m, pte) && (pte.pte & PTE_ACCESS_MASK);
+}

Same question as on the earlier patch - does P2M type apply to intermediate
page tables at all? (Conceptually it shouldn't.)


It doesn't matter whether it is an intermediate page table or a leaf PTE 
pointing
to a page — PTE should be valid. Considering that in the current implementation
it’s possible for PTE.v = 0 but P2M.v = 1, it is better to check P2M.v instead
of PTE.v.

@@ -492,6 +503,70 @@ static pte_t p2m_entry_from_mfn(struct p2m_domain *p2m, 
mfn_t mfn, p2m_type_t t,
      return e;
  }

+/* Generate table entry with correct attributes. */

+static pte_t page_to_p2m_table(struct p2m_domain *p2m, struct page_info *page)
+{
+    /*
+     * Since this function generates a table entry, according to "Encoding
+     * of PTE R/W/X fields," the entry's r, w, and x fields must be set to 0
+     * to point to the next level of the page table.
+     * Therefore, to ensure that an entry is a page table entry,
+     * `p2m_access_n2rwx` is passed to `mfn_to_p2m_entry()` as the access 
value,
+     * which overrides whatever was passed as `p2m_type_t` and guarantees that
+     * the entry is a page table entry by setting r = w = x = 0.
+     */
+    return p2m_entry_from_mfn(p2m, page_to_mfn(page), p2m_ram_rw, 
p2m_access_n2rwx);

Similarly P2M access shouldn't apply to intermediate page tables. (Moot
with that, but (ab)using p2m_access_n2rwx would also look wrong: You did
read what it means, didn't you?)


|p2m_access_n2rwx| was chosen not really because of the description mentioned 
near
its declaration, but because it sets r=w=x=0, which RISC-V expects for a PTE 
that
points to the next-level page table.

Generally, I agree that P2M access shouldn't be applied to intermediate page 
tables.

What I can suggest in this case is to use|p2m_access_rwx| instead 
of|p2m_access_n2rwx|,
which will ensure that the P2M access type isn't applied 
when|p2m_entry_from_mfn() |is called, and then, after 
calling|p2m_entry_from_mfn()|, simply set|PTE.r,w,x=0|.
So this function will look like:
    /* Generate table entry with correct attributes. */
    static pte_t page_to_p2m_table(struct p2m_domain *p2m, struct page_info 
*page)
    {
        /*
        * p2m_ram_rw is chosen for a table entry as p2m table should be valid
        * from both P2M and hardware point of view.
        *
        * p2m_access_rwx is chosen to restrict access permissions, what mean
        * do not apply access permission for a table entry
        */
        pte_t pte = p2m_pte_from_mfn(p2m, page_to_mfn(page), _gfn(0), 
p2m_ram_rw,
                                    p2m_access_rwx);

        /*
        * Since this function generates a table entry, according to "Encoding
        * of PTE R/W/X fields," the entry's r, w, and x fields must be set to 0
        * to point to the next level of the page table.
        */
        pte.pte &= ~PTE_ACCESS_MASK;

        return pte;
    }

Does this make sense? Or would it be better to keep the current version of
|page_to_p2m_table()| and just improve the comment explaining 
why|p2m_access_n2rwx |is used for a table entry?

+}
+
+static struct page_info *p2m_alloc_page(struct domain *d)
+{
+    struct page_info *pg;
+
+    /*
+     * For hardware domain, there should be no limit in the number of pages 
that
+     * can be allocated, so that the kernel may take advantage of the extended
+     * regions. Hence, allocate p2m pages for hardware domains from heap.
+     */
+    if ( is_hardware_domain(d) )
+    {
+        pg = alloc_domheap_page(d, MEMF_no_owner);
+        if ( pg == NULL )
+            printk(XENLOG_G_ERR "Failed to allocate P2M pages for hwdom.\n");
+    }

The comment looks to have been taken verbatim from Arm. Whatever "extended
regions" are, does the same concept even exist on RISC-V?


Initially, I missed that it’s used only for Arm. Since it was mentioned in
|doc/misc/xen-command-line.pandoc|, I assumed it applied to all architectures.
But now I see that it’s Arm-specific:: ### ext_regions (Arm)


Also, special casing Dom0 like this has benefits, but also comes with a
pitfall: If the system's out of memory, allocations will fail. A pre-
populated pool would avoid that (until exhausted, of course). If special-
casing of Dom0 is needed, I wonder whether ...

+    else
+    {
+        spin_lock(&d->arch.paging.lock);
+        pg = page_list_remove_head(&d->arch.paging.p2m_freelist);
+        spin_unlock(&d->arch.paging.lock);
+    }

... going this path but with a Dom0-only fallback to general allocation
wouldn't be the better route.


IIUC, then it should be something like:
  static struct page_info *p2m_alloc_page(struct domain *d)
  {
      struct page_info *pg;

spin_lock(&d->arch.paging.lock);

      pg = page_list_remove_head(&d->arch.paging.p2m_freelist);
      spin_unlock(&d->arch.paging.lock);

      if ( !pg && is_hardware_domain(d) )
      {
            /* Need to allocate more memory from domheap */
            pg = alloc_domheap_page(d, MEMF_no_owner);
            if ( pg == NULL )
            {
                printk(XENLOG_ERR "Failed to allocate pages.\n");
                return pg;
            }
            ACCESS_ONCE(d->arch.paging.total_pages)++;
            page_list_add_tail(pg, &d->arch.paging.freelist);
      }

return pg;

}

And basically use|d->arch.paging.freelist| for both dom0less and dom0 domains,
with the only difference being that in the case of 
Dom0,|d->arch.paging.freelist |could be extended.

Do I understand your idea correctly?

(
Probably, this is the reply you’re referring to:
  
https://lore.kernel.org/xen-devel/43e89225-5e69-49a6-a8c8-bda6d120d...@suse.com/,
at the moment, I can't find a better one.
)

+    return pg;
+}
+
+/* Allocate a new page table page and hook it in via the given entry. */
+static int p2m_create_table(struct p2m_domain *p2m, pte_t *entry)
+{
+    struct page_info *page;
+    pte_t *p;
+
+    ASSERT(!p2me_is_valid(p2m, *entry));
+
+    page = p2m_alloc_page(p2m->domain);
+    if ( page == NULL )
+        return -ENOMEM;
+
+    page_list_add(page, &p2m->pages);
+
+    p = __map_domain_page(page);
+    clear_page(p);
+
+    unmap_domain_page(p);

clear_domain_page()? Or actually clear_and_clean_page()?


Agree, clear_and_clean_page() would be better here.

@@ -516,9 +591,33 @@ static int p2m_next_level(struct p2m_domain *p2m, bool 
alloc_tbl,
                            unsigned int level, pte_t **table,
                            unsigned int offset)
  {
-    panic("%s: hasn't been implemented yet\n", __func__);
+    pte_t *entry;
+    int ret;
+    mfn_t mfn;
+
+    entry = *table + offset;
+
+    if ( !p2me_is_valid(p2m, *entry) )
+    {
+        if ( !alloc_tbl )
+            return GUEST_TABLE_MAP_NONE;
+
+        ret = p2m_create_table(p2m, entry);
+        if ( ret )
+            return GUEST_TABLE_MAP_NOMEM;
+    }
+
+    /* The function p2m_next_level() is never called at the last level */
+    ASSERT(level != 0);

Logically you would perhaps better do this ahead of trying to allocate a
page table. Calls here with level == 0 are invalid in all cases aiui, not
just when you make it here.


It makes sense. I will move ASSERT() to the start of the function
p2m_next_level().

+    if ( p2me_is_mapping(p2m, *entry) )
+        return GUEST_TABLE_SUPER_PAGE;
+
+    mfn = mfn_from_pte(*entry);
+
+    unmap_domain_page(*table);
+    *table = map_domain_page(mfn);

Just to mention it (may not need taking care of right away), there's an
inefficiency here: In p2m_create_table() you map the page to clear it.
Then you tear down that mapping, just to re-establish it here.


I will add:
    /*
     * TODO: There's an inefficiency here:
     *       In p2m_create_table(), the page is mapped to clear it.
     *       Then that mapping is torn down in p2m_create_table(),
     *       only to be re-established here.
     */
    *table = map_domain_page(mfn);

Thanks.

~ Oleksii

Re: [PATCH v2 14/17] xen/riscv: implement p2m_next_level()

Reply via email to