On Thu, Jul 17, 2025 at 10:03:31PM +0200, David Hildenbrand wrote: > > > The report will now look something like (dumping pgd to pmd values): > > > > > > [ 77.943408] BUG: Bad page map in process XXX entry:80000001233f5867 > > > [ 77.944077] addr:00007fd84bb1c000 vm_flags:08100071 anon_vma: ... > > > [ 77.945186] pgd:10a89f067 p4d:10a89f067 pud:10e5a2067 pmd:105327067 > > > > > > Not using pgdp_get(), because that does not work properly on some arm > > > configs where pgd_t is an array. Note that we are dumping all levels > > > even when levels are folded for simplicity. > > > > Oh god. I reviewed this below. BUT OH GOD. What. Why??? > > Exactly. > > I wish this patch wouldn't exist, but Hugh convinced me that apparently it > can be useful in the real world.
Yeah... I mean out of scope for here but that sounds dubious. On the other hand, we use typedef for page table values etc. etc. so we do make this possible. > > > > > > > > > Signed-off-by: David Hildenbrand <da...@redhat.com> > > > --- > > > mm/memory.c | 120 ++++++++++++++++++++++++++++++++++++++++------------ > > > 1 file changed, 94 insertions(+), 26 deletions(-) > > > > > > diff --git a/mm/memory.c b/mm/memory.c > > > index 173eb6267e0ac..08d16ed7b4cc7 100644 > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -473,22 +473,8 @@ static inline void add_mm_rss_vec(struct mm_struct > > > *mm, int *rss) > > > add_mm_counter(mm, i, rss[i]); > > > } > > > > > > -/* > > > - * This function is called to print an error when a bad pte > > > - * is found. For example, we might have a PFN-mapped pte in > > > - * a region that doesn't allow it. > > > - * > > > - * The calling function must still handle the error. > > > - */ > > > -static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr, > > > - pte_t pte, struct page *page) > > > +static bool is_bad_page_map_ratelimited(void) > > > { > > > - pgd_t *pgd = pgd_offset(vma->vm_mm, addr); > > > - p4d_t *p4d = p4d_offset(pgd, addr); > > > - pud_t *pud = pud_offset(p4d, addr); > > > - pmd_t *pmd = pmd_offset(pud, addr); > > > - struct address_space *mapping; > > > - pgoff_t index; > > > static unsigned long resume; > > > static unsigned long nr_shown; > > > static unsigned long nr_unshown; > > > @@ -500,7 +486,7 @@ static void print_bad_pte(struct vm_area_struct *vma, > > > unsigned long addr, > > > if (nr_shown == 60) { > > > if (time_before(jiffies, resume)) { > > > nr_unshown++; > > > - return; > > > + return true; > > > } > > > if (nr_unshown) { > > > pr_alert("BUG: Bad page map: %lu messages > > > suppressed\n", > > > @@ -511,15 +497,87 @@ static void print_bad_pte(struct vm_area_struct > > > *vma, unsigned long addr, > > > } > > > if (nr_shown++ == 0) > > > resume = jiffies + 60 * HZ; > > > + return false; > > > +} > > > + > > > +static void __dump_bad_page_map_pgtable(struct mm_struct *mm, unsigned > > > long addr) > > > +{ > > > + unsigned long long pgdv, p4dv, pudv, pmdv; > > > > > + p4d_t p4d, *p4dp; > > > + pud_t pud, *pudp; > > > + pmd_t pmd, *pmdp; > > > + pgd_t *pgdp; > > > + > > > + /* > > > + * This looks like a fully lockless walk, however, the caller is > > > + * expected to hold the leaf page table lock in addition to other > > > + * rmap/mm/vma locks. So this is just a re-walk to dump page table > > > + * content while any concurrent modifications should be completely > > > + * prevented. > > > + */ > > > > Hmmm :) > > > > Why aren't we trying to lock at leaf level? > > Ehm, did you read the: > > "the caller is expected to hold the leaf page table lock" > > :) Yeah sorry I was in 'what locks do we need' mode and hadn't shifted back here, but I guess the intent is that the caller _must_ hold this lock. I know it's nitty and annoying (sorry!) but as asserting seems to not be a possibility here, could we spell these out as a series of points like: /* * The caller MUST hold the following locks: * * - Leaf page table lock * - Appropriate VMA lock to keep VMA stable */ I don't _actually_ think you need the rmap lock then, as none of the page tables you access would be impacted by any rmap action afaict, with these locks held. > > > > > > We need to: > > > > - Keep VMA stable which prevents unmap page table teardown and khugepaged > > collapse. > > - (not relevant as we don't traverse PTE table but) RCU lock for PTE > > entries to avoid MADV_DONTNEED page table withdrawal. > > > > Buuut if we're not locking at leaf level, we leave ourselves open to racing > > faults, zaps, etc. etc. > > Yes. I can clarify in the comment of print_bad_page_map(), that it is > expected to be called with the PTL (unwritten rule, not changing that). Right, see above, just needs more clarity then. > > > > > So perhaps this why you require such strict conditions... > > > > But can you truly be sure of these existing? And we should then assert them > > here no? For rmap though we'd need the folio/vma. > > I hope you realize that this nastiness of a code is called in case our > system is already running into something extremely unexpected and will > probably be dead soon. > > So I am not to interested in adding anything more here. If you run into this > code you're in big trouble already. Yes am aware :) my concern is NULL ptr deref or UAF, but with the locks held as stated those won't occur. But f it's not sensible to do it then we don't have to :) I am a reasonable man, or like to think I am ;) But I think we need clarity as per the above. > > > > > > + pgdp = pgd_offset(mm, addr); > > > + pgdv = pgd_val(*pgdp); > > > > Before I went and looked again at the commit msg I said: > > > > "Shoudln't we strictly speaking use pgdp_get()? I see you use this > > helper for other levels." > > > > But obviously yeah. You explained the insane reason why not. > > Had to find out the hard way ... :) Pain. > > [...] > > > > +/* > > > + * This function is called to print an error when a bad page table entry > > > (e.g., > > > + * corrupted page table entry) is found. For example, we might have a > > > + * PFN-mapped pte in a region that doesn't allow it. > > > + * > > > + * The calling function must still handle the error. > > > + */ > > > > We have extremely strict locking conditions for the page table traversal... > > but > > no mention of them here? > > Yeah, I can add that. Thanks! > > > > > > +static void print_bad_page_map(struct vm_area_struct *vma, > > > + unsigned long addr, unsigned long long entry, struct page *page) > > > +{ > > > + struct address_space *mapping; > > > + pgoff_t index; > > > + > > > + if (is_bad_page_map_ratelimited()) > > > + return; > > > > > > mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL; > > > index = linear_page_index(vma, addr); > > > > > > - pr_alert("BUG: Bad page map in process %s pte:%08llx pmd:%08llx\n", > > > - current->comm, > > > - (long long)pte_val(pte), (long long)pmd_val(*pmd)); > > > + pr_alert("BUG: Bad page map in process %s entry:%08llx", > > > current->comm, entry); > > > > Sort of wonder if this is even useful if you don't know what the 'entry' > > is? But I guess the dump below will tell you. > > You probably missed in the patch description: > > "Whether it is a PTE or something else will usually become obvious from the > page table dump or from the dumped stack. If ever required in the future, we > could pass the entry level type similar to "enum rmap_level". For now, let's > keep it simple." Yeah sorry I glossed over the commit msg, and now I pay for it ;) OK this is fine then. > > > > > Though maybe actually useful to see flags etc. in case some horrid > > corruption happened and maybe dump isn't valid? But then the dump assumes > > strict conditions to work so... can that happen? > > Not sure what you mean, can you elaborate? > > If you system is falling apart completely, I'm afraid this report here will > not safe you. > > You'll probably get a flood of 60 ... if your system doesn't collapse before > that. I was musing whistfully about the value of outputting the entries. But I guess _any_ information before a crash is useful. But your dump output will be a lot more useful :) > > > > > > + __dump_bad_page_map_pgtable(vma->vm_mm, addr); > > > if (page) > > > - dump_page(page, "bad pte"); > > > + dump_page(page, "bad page map"); > > > pr_alert("addr:%px vm_flags:%08lx anon_vma:%px mapping:%px > > > index:%lx\n", > > > (void *)addr, vma->vm_flags, vma->anon_vma, mapping, > > > index); > > > pr_alert("file:%pD fault:%ps mmap:%ps mmap_prepare: %ps > > > read_folio:%ps\n", > > > @@ -597,7 +655,7 @@ struct page *vm_normal_page(struct vm_area_struct > > > *vma, unsigned long addr, > > > if (is_zero_pfn(pfn)) > > > return NULL; > > > > > > - print_bad_pte(vma, addr, pte, NULL); > > > + print_bad_page_map(vma, addr, pte_val(pte), NULL); > > > return NULL; > > > } > > > > > > @@ -625,7 +683,7 @@ struct page *vm_normal_page(struct vm_area_struct > > > *vma, unsigned long addr, > > > > > > check_pfn: > > > if (unlikely(pfn > highest_memmap_pfn)) { > > > - print_bad_pte(vma, addr, pte, NULL); > > > + print_bad_page_map(vma, addr, pte_val(pte), NULL); > > > > This is unrelated to your series, but I guess this is for cases where > > you're e.g. iomapping or such? So it's not something in the memmap but it's > > a PFN that might reference io memory or such? > > No, it's just an easy check for catching corruptions. In a later patch I add > a comment. Ohhh right. I was overthinking this haha > > > > > > return NULL; > > > } > > > > > > @@ -654,8 +712,15 @@ struct page *vm_normal_page_pmd(struct > > > vm_area_struct *vma, unsigned long addr, > > > { > > > unsigned long pfn = pmd_pfn(pmd); > > > > > > - if (unlikely(pmd_special(pmd))) > > > + if (unlikely(pmd_special(pmd))) { > > > + if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)) > > > + return NULL; > > > > I guess we'll bring this altogether in a later patch with vm_normal_page() > > as getting a little duplicative :P > > Not that part, but the other part :) Yes :) > > > > > Makes me think that VM_SPECIAL is kind of badly named (other than fact > > 'special' is nebulous and overloaded in general) in that it contains stuff > > that is -VMA-special but only VM_PFNMAP | VM_MIXEDMAP really indicates > > specialness wrt to underlying folio. > > It is. Yeah I think we in mm periodically moan about this whole thing... > > > > > Then we have VM_IO, which strictly must not have an associated page right? > > VM_IO just means read/write side-effects, I think you could have ones with > an memmap easily ... e.g., memory section (128MiB) spanning both memory and > MMIO regions. Hmm, but why not have two separate VMAs? I guess I need to look into more what this flag actually effects. But in terms of VMA manipulation in any case it really is no different from e.g. VM_PFNMAP in terms of what it affects. Though well. I say that. Except of course this whole vm_normal_page() thing...! But I've not seen VM_IO set alone anywhere. On the other hand, I haven't really dug deep on this... > > Thanks! > > -- > Cheers, > > David / dhildenb > Cheers, Lorenzo