On Thu, Jul 17, 2025 at 10:03:31PM +0200, David Hildenbrand wrote:
> > > The report will now look something like (dumping pgd to pmd values):
> > >
> > > [   77.943408] BUG: Bad page map in process XXX  entry:80000001233f5867
> > > [   77.944077] addr:00007fd84bb1c000 vm_flags:08100071 anon_vma: ...
> > > [   77.945186] pgd:10a89f067 p4d:10a89f067 pud:10e5a2067 pmd:105327067
> > >
> > > Not using pgdp_get(), because that does not work properly on some arm
> > > configs where pgd_t is an array. Note that we are dumping all levels
> > > even when levels are folded for simplicity.
> >
> > Oh god. I reviewed this below. BUT OH GOD. What. Why???
>
> Exactly.
>
> I wish this patch wouldn't exist, but Hugh convinced me that apparently it
> can be useful in the real world.

Yeah... I mean out of scope for here but that sounds dubious. On the other hand,
we use typedef for page table values etc. etc. so we do make this possible.

>
> >
> > >
> > > Signed-off-by: David Hildenbrand <da...@redhat.com>
> > > ---
> > >   mm/memory.c | 120 ++++++++++++++++++++++++++++++++++++++++------------
> > >   1 file changed, 94 insertions(+), 26 deletions(-)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 173eb6267e0ac..08d16ed7b4cc7 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -473,22 +473,8 @@ static inline void add_mm_rss_vec(struct mm_struct 
> > > *mm, int *rss)
> > >                           add_mm_counter(mm, i, rss[i]);
> > >   }
> > >
> > > -/*
> > > - * This function is called to print an error when a bad pte
> > > - * is found. For example, we might have a PFN-mapped pte in
> > > - * a region that doesn't allow it.
> > > - *
> > > - * The calling function must still handle the error.
> > > - */
> > > -static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> > > -                   pte_t pte, struct page *page)
> > > +static bool is_bad_page_map_ratelimited(void)
> > >   {
> > > - pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > > - p4d_t *p4d = p4d_offset(pgd, addr);
> > > - pud_t *pud = pud_offset(p4d, addr);
> > > - pmd_t *pmd = pmd_offset(pud, addr);
> > > - struct address_space *mapping;
> > > - pgoff_t index;
> > >           static unsigned long resume;
> > >           static unsigned long nr_shown;
> > >           static unsigned long nr_unshown;
> > > @@ -500,7 +486,7 @@ static void print_bad_pte(struct vm_area_struct *vma, 
> > > unsigned long addr,
> > >           if (nr_shown == 60) {
> > >                   if (time_before(jiffies, resume)) {
> > >                           nr_unshown++;
> > > -                 return;
> > > +                 return true;
> > >                   }
> > >                   if (nr_unshown) {
> > >                           pr_alert("BUG: Bad page map: %lu messages 
> > > suppressed\n",
> > > @@ -511,15 +497,87 @@ static void print_bad_pte(struct vm_area_struct 
> > > *vma, unsigned long addr,
> > >           }
> > >           if (nr_shown++ == 0)
> > >                   resume = jiffies + 60 * HZ;
> > > + return false;
> > > +}
> > > +
> > > +static void __dump_bad_page_map_pgtable(struct mm_struct *mm, unsigned 
> > > long addr)
> > > +{
> > > + unsigned long long pgdv, p4dv, pudv, pmdv;
> >
> > > + p4d_t p4d, *p4dp;
> > > + pud_t pud, *pudp;
> > > + pmd_t pmd, *pmdp;
> > > + pgd_t *pgdp;
> > > +
> > > + /*
> > > +  * This looks like a fully lockless walk, however, the caller is
> > > +  * expected to hold the leaf page table lock in addition to other
> > > +  * rmap/mm/vma locks. So this is just a re-walk to dump page table
> > > +  * content while any concurrent modifications should be completely
> > > +  * prevented.
> > > +  */
> >
> > Hmmm :)
> >
> > Why aren't we trying to lock at leaf level?
>
> Ehm, did you read the:
>
> "the caller is expected to hold the leaf page table lock"
>
> :)

Yeah sorry I was in 'what locks do we need' mode and hadn't shifted back here,
but I guess the intent is that the caller _must_ hold this lock.

I know it's nitty and annoying (sorry!) but as asserting seems to not be a
possibility here, could we spell these out as a series of points like:

/*
 * The caller MUST hold the following locks:
 *
 * - Leaf page table lock
 * - Appropriate VMA lock to keep VMA stable
 */

I don't _actually_ think you need the rmap lock then, as none of the page tables
you access would be impacted by any rmap action afaict, with these locks held.
>
>
> >
> > We need to:
> >
> > - Keep VMA stable which prevents unmap page table teardown and khugepaged
> >    collapse.
> > - (not relevant as we don't traverse PTE table but) RCU lock for PTE
> >    entries to avoid MADV_DONTNEED page table withdrawal.
> >
> > Buuut if we're not locking at leaf level, we leave ourselves open to racing
> > faults, zaps, etc. etc.
>
> Yes. I can clarify in the comment of print_bad_page_map(), that it is
> expected to be called with the PTL (unwritten rule, not changing that).

Right, see above, just needs more clarity then.

>
> >
> > So perhaps this why you require such strict conditions...
> >
> > But can you truly be sure of these existing? And we should then assert them
> > here no? For rmap though we'd need the folio/vma.
>
> I hope you realize that this nastiness of a code is called in case our
> system is already running into something extremely unexpected and will
> probably be dead soon.
>
> So I am not to interested in adding anything more here. If you run into this
> code you're in big trouble already.

Yes am aware :) my concern is NULL ptr deref or UAF, but with the locks
held as stated those won't occur.

But f it's not sensible to do it then we don't have to :) I am a reasonable
man, or like to think I am ;)

But I think we need clarity as per the above.

>
> >
> > > + pgdp = pgd_offset(mm, addr);
> > > + pgdv = pgd_val(*pgdp);
> >
> > Before I went and looked again at the commit msg I said:
> >
> >     "Shoudln't we strictly speaking use pgdp_get()? I see you use this
> >      helper for other levels."
> >
> > But obviously yeah. You explained the insane reason why not.
>
> Had to find out the hard way ... :)

Pain.

>
> [...]
>
> > > +/*
> > > + * This function is called to print an error when a bad page table entry 
> > > (e.g.,
> > > + * corrupted page table entry) is found. For example, we might have a
> > > + * PFN-mapped pte in a region that doesn't allow it.
> > > + *
> > > + * The calling function must still handle the error.
> > > + */
> >
> > We have extremely strict locking conditions for the page table traversal... 
> > but
> > no mention of them here?
>
> Yeah, I can add that.

Thanks!

>
> >
> > > +static void print_bad_page_map(struct vm_area_struct *vma,
> > > +         unsigned long addr, unsigned long long entry, struct page *page)
> > > +{
> > > + struct address_space *mapping;
> > > + pgoff_t index;
> > > +
> > > + if (is_bad_page_map_ratelimited())
> > > +         return;
> > >
> > >           mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL;
> > >           index = linear_page_index(vma, addr);
> > >
> > > - pr_alert("BUG: Bad page map in process %s  pte:%08llx pmd:%08llx\n",
> > > -          current->comm,
> > > -          (long long)pte_val(pte), (long long)pmd_val(*pmd));
> > > + pr_alert("BUG: Bad page map in process %s  entry:%08llx", 
> > > current->comm, entry);
> >
> > Sort of wonder if this is even useful if you don't know what the 'entry'
> > is? But I guess the dump below will tell you.
>
> You probably missed in the patch description:
>
> "Whether it is a PTE or something else will usually become obvious from the
> page table dump or from the dumped stack. If ever required in the future, we
> could pass the entry level type similar to "enum rmap_level". For now, let's
> keep it simple."

Yeah sorry I glossed over the commit msg, and now I pay for it ;) OK this
is fine then.

>
> >
> > Though maybe actually useful to see flags etc. in case some horrid
> > corruption happened and maybe dump isn't valid? But then the dump assumes
> > strict conditions to work so... can that happen?
>
> Not sure what you mean, can you elaborate?
>
> If you system is falling apart completely, I'm afraid this report here will
> not safe you.
>
> You'll probably get a flood of 60 ... if your system doesn't collapse before
> that.

I was musing whistfully about the value of outputting the entries. But I
guess _any_ information before a crash is useful. But your dump output will
be a lot more useful :)

>
> >
> > > + __dump_bad_page_map_pgtable(vma->vm_mm, addr);
> > >           if (page)
> > > -         dump_page(page, "bad pte");
> > > +         dump_page(page, "bad page map");
> > >           pr_alert("addr:%px vm_flags:%08lx anon_vma:%px mapping:%px 
> > > index:%lx\n",
> > >                    (void *)addr, vma->vm_flags, vma->anon_vma, mapping, 
> > > index);
> > >           pr_alert("file:%pD fault:%ps mmap:%ps mmap_prepare: %ps 
> > > read_folio:%ps\n",
> > > @@ -597,7 +655,7 @@ struct page *vm_normal_page(struct vm_area_struct 
> > > *vma, unsigned long addr,
> > >                   if (is_zero_pfn(pfn))
> > >                           return NULL;
> > >
> > > -         print_bad_pte(vma, addr, pte, NULL);
> > > +         print_bad_page_map(vma, addr, pte_val(pte), NULL);
> > >                   return NULL;
> > >           }
> > >
> > > @@ -625,7 +683,7 @@ struct page *vm_normal_page(struct vm_area_struct 
> > > *vma, unsigned long addr,
> > >
> > >   check_pfn:
> > >           if (unlikely(pfn > highest_memmap_pfn)) {
> > > -         print_bad_pte(vma, addr, pte, NULL);
> > > +         print_bad_page_map(vma, addr, pte_val(pte), NULL);
> >
> > This is unrelated to your series, but I guess this is for cases where
> > you're e.g. iomapping or such? So it's not something in the memmap but it's
> > a PFN that might reference io memory or such?
>
> No, it's just an easy check for catching corruptions. In a later patch I add
> a comment.

Ohhh right. I was overthinking this haha

>
> >
> > >                   return NULL;
> > >           }
> > >
> > > @@ -654,8 +712,15 @@ struct page *vm_normal_page_pmd(struct 
> > > vm_area_struct *vma, unsigned long addr,
> > >   {
> > >           unsigned long pfn = pmd_pfn(pmd);
> > >
> > > - if (unlikely(pmd_special(pmd)))
> > > + if (unlikely(pmd_special(pmd))) {
> > > +         if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
> > > +                 return NULL;
> >
> > I guess we'll bring this altogether in a later patch with vm_normal_page()
> > as getting a little duplicative :P
>
> Not that part, but the other part :)

Yes :)

>
> >
> > Makes me think that VM_SPECIAL is kind of badly named (other than fact
> > 'special' is nebulous and overloaded in general) in that it contains stuff
> > that is -VMA-special but only VM_PFNMAP | VM_MIXEDMAP really indicates
> > specialness wrt to underlying folio.
>
> It is.

Yeah I think we in mm periodically moan about this whole thing...

>
> >
> > Then we have VM_IO, which strictly must not have an associated page right?
>
> VM_IO just means read/write side-effects, I think you could have ones with
> an memmap easily ... e.g., memory section (128MiB) spanning both memory and
> MMIO regions.

Hmm, but why not have two separate VMAs? I guess I need to look into more
what this flag actually effects.

But in terms of VMA manipulation in any case it really is no different from
e.g. VM_PFNMAP in terms of what it affects. Though well. I say that. Except
of course this whole vm_normal_page() thing...!

But I've not seen VM_IO set alone anywhere. On the other hand, I haven't
really dug deep on this...

>
> Thanks!
>
> --
> Cheers,
>
> David / dhildenb
>

Cheers, Lorenzo

Reply via email to