On Sat, Jan 09, 2021 at 05:37:09PM -0800, Linus Torvalds wrote:
> On Sat, Jan 9, 2021 at 5:19 PM Linus Torvalds
> wrote:
> >
> > And no, I didn't make the UFFDIO_WRITEPROTECT code take the mmap_sem
> > for writing. For whoever wants to look at that, it's
> > mwriteprotect_range() in mm/userfaultfd
Hello,
On Sat, Jan 09, 2021 at 07:44:35PM -0500, Andrea Arcangeli wrote:
> allowing a child to corrupt memory in the parent. That's a problem
> that could happen not-maliciously too. So the scenario described
I updated the above partly quoted sentence since in the previous
version it
Hello Linus,
On Sat, Jan 09, 2021 at 05:19:51PM -0800, Linus Torvalds wrote:
> +#define is_cow_mapping(flags) (((flags) & (VM_SHARED | VM_MAYWRITE)) ==
> VM_MAYWRITE)
> +
> +static inline bool pte_is_pinned(struct vm_area_struct *vma, unsigned long
> addr, pte_t pte)
> +{
> + struct page *pa
ot;);
} else
printf("memory corruption detected\n");
}
skip_memset = !skip_memset;
if (!skip_memset)
memset(mem, 0xff, HARDBLKSIZE);
}
return 0;
}
And
splice itself,
and only at a second stage in the COW code.
Link: https://lkml.kernel.org/r/20210107200402.31095-1-aarca...@redhat.com
Cc: sta...@kernel.org
Fixes: 09854ba94c6a ("mm: do_wp_page() simplification")
Signed-off-by: Andrea Arcangeli
---
include/linux/ksm.h | 7 ++
mm/
Hello Jason,
On Fri, Jan 08, 2021 at 08:42:55PM -0400, Jason Gunthorpe wrote:
> There is already a patch series floating about to do exactly that for
> FOLL_LONGTERM pins based on the existing code in GUP for CMA migration
Sounds great.
> The ship sailed on this a decade ago, it is completely in
On Fri, Jan 08, 2021 at 11:25:21AM -0800, Linus Torvalds wrote:
> On Fri, Jan 8, 2021 at 9:53 AM Andrea Arcangeli wrote:
> >
> > Do you intend to eventually fix the zygote vmsplice case or not?
> > Because in current upstream it's not fixed currently using the
>
gt; writable.
>
> I can't find any users at all of this mechanism, so just remove it.
Reviewed-by: Andrea Arcangeli
On Fri, Jan 08, 2021 at 10:31:24AM -0800, Andy Lutomirski wrote:
> Can we just remove vmsplice() support? We could make it do a normal
The single case I've seen vmsplice used so far, that was really cool
is localhost live migration of qemu. However despite really cool, it
wasn't merged in the end
On Fri, Jan 08, 2021 at 02:19:45PM -0400, Jason Gunthorpe wrote:
> On Fri, Jan 08, 2021 at 12:00:36PM -0500, Andrea Arcangeli wrote:
> > > The majority cannot be converted to notifiers because they are DMA
> > > based. Every one of those is an ABI for something, and does n
On Fri, Jan 08, 2021 at 09:39:56AM -0800, Linus Torvalds wrote:
> page_count() is simply the right and efficient thing to do.
>
> You talk about all these theoretical inefficiencies for cases like
> zygote and page pinning, which have never ever been seen except as a
> possible attack vector.
Do
On Fri, Jan 08, 2021 at 09:36:49AM -0400, Jason Gunthorpe wrote:
> On Thu, Jan 07, 2021 at 04:45:33PM -0500, Andrea Arcangeli wrote:
> > On Thu, Jan 07, 2021 at 04:25:25PM -0400, Jason Gunthorpe wrote:
> > > On Thu, Jan 07, 2021 at 03:04:00PM -0500, Andrea Arcangeli wrote:
>
Hello everyone,
On Fri, Jan 08, 2021 at 12:48:16PM +, Will Deacon wrote:
> On Thu, Jan 07, 2021 at 04:25:54PM -0800, Linus Torvalds wrote:
> > Please. Why is the correct patch not the attached one (apart from the
> > obvious fact that I haven't tested it and maybe just screwed up
> > completel
On Thu, Jan 07, 2021 at 02:51:24PM -0800, Linus Torvalds wrote:
> Ho humm. I had obviously not looked very much at that code. I had done
> a quick git grep, but now that I look closer, it *does* get the
> mmap_sem for writing, but only for that VM_SOFTDIRTY bit clearing, and
> then it does a mmap_w
On Thu, Jan 07, 2021 at 02:42:17PM -0800, Linus Torvalds wrote:
> On Thu, Jan 7, 2021 at 2:31 PM Andrea Arcangeli wrote:
> >
> > Random memory corruption will still silently materialize as result of
> > the speculative lookups in the above scenario.
>
> Explain.
On Thu, Jan 07, 2021 at 02:17:50PM -0800, Linus Torvalds wrote:
> So I think we can agree that even that softdirty case we can just
> handle by "don't do that then".
Absolutely. The question is if somebody was happily running clear_refs
with a RDMA attached to the process, by the time they update
On Thu, Jan 07, 2021 at 01:29:43PM -0800, Linus Torvalds wrote:
> On Thu, Jan 7, 2021 at 12:59 PM Andrea Arcangeli wrote:
> >
> > The problem is it's not even possible to detect reliably if there's
> > really a long term GUP pin because of speculative pagecache look
On Thu, Jan 07, 2021 at 01:05:19PM -0800, Linus Torvalds wrote:
> I think those would very much be worth fixing, so that if
> UFFDIO_WRITEPROTECT taking the mmapo_sem for writing causes problems,
> we can _fix_ those problems.
>
> But I think it's entirely wrong to treat UFFDIO_WRITEPROTECT as
> s
On Thu, Jan 07, 2021 at 12:32:09PM -0800, Linus Torvalds wrote:
> I think Andrea is blinded by his own love for UFFDIO: when I do a
> debian codesearch for UFFDIO_WRITEPROTECT, all it finds is the kernel
> and strace (and the qemu copies of the kernel headers).
For the record, I feel obliged to re
On Thu, Jan 07, 2021 at 04:25:25PM -0400, Jason Gunthorpe wrote:
> On Thu, Jan 07, 2021 at 03:04:00PM -0500, Andrea Arcangeli wrote:
>
> > vmsplice syscall API is insecure allowing long term GUP PINs without
> > privilege.
>
> Lots of places are relying on pin_user_pages
Hi Linus,
On Thu, Jan 07, 2021 at 12:17:40PM -0800, Linus Torvalds wrote:
> On Thu, Jan 7, 2021 at 12:04 PM Andrea Arcangeli wrote:
> >
> > However there are two cases that could wrprotecting exclusive anon
> > pages with only the mmap_read_lock:
>
> I still think the
lly. So I tried to fix even clear_refs to
cope with it, but this is only the tip of the icerbeg of what really
breaks.
So in short I contextually self-NAK 2/2 of this patchset and we need
to somehow reverse 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 instead.
Thanks,
Andrea
Andrea Arcangeli (1):
mm
+1104+1667+1101+1365+913+1108)
bounces: 71, mode: rnd racing ver read, page_nr 25241 memory corruption 6 7
After the commit the userland memory corruption is gone as expected.
Cc: sta...@kernel.org
Reported-by: Nadav Amit
Suggested-by: Yu Zhao
Signed-off-by: Andrea Arcangeli
---
fs/pr
g the mmu_gather API altogether: managing both the
'tlb_flush_pending' flag on the 'mm_struct' and explicit TLB
invalidation for the sort-dirty path, much like mprotect() does already.
Fixes: 0758cd830494 ("asm-generic/tlb: avoid potential double flush")
Signed-off-by: Wil
On Thu, Jan 07, 2021 at 06:28:29PM +0100, Vlastimil Babka wrote:
> On 1/6/21 9:18 PM, Hugh Dickins wrote:
> > On Wed, 6 Jan 2021, Andrea Arcangeli wrote:
> >>
> >> I'd be surprised if the kernel can boot with BUG_ON() defined as "do
> >> {}while
Hello,
On Wed, Jan 06, 2021 at 11:46:20AM -0800, Andrew Morton wrote:
> On Tue, 5 Jan 2021 20:28:27 -0800 (PST) Hugh Dickins wrote:
>
> > Alex, please consider why the authors of these lines (whom you
> > did not Cc) chose to write them without BUG_ON(): it has always
> > been preferred practice
On Tue, Jan 05, 2021 at 10:16:29PM +, Will Deacon wrote:
> On Tue, Jan 05, 2021 at 09:22:51PM +, Nadav Amit wrote:
> > > On Jan 5, 2021, at 12:39 PM, Andrea Arcangeli wrote:
> > >
> > > On Tue, Jan 05, 2021 at 07:26:43PM +, Nadav Amit wrote:
>
On Tue, Jan 05, 2021 at 09:22:51PM +, Nadav Amit wrote:
> It is also about performance due to unwarranted TLB flushes.
If there will be a problem switching to the wait_flush_pending() model
suggested by Peter may not even require changes to the common code in
memory.c since I'm thinking it may
On Tue, Jan 05, 2021 at 08:06:22PM +, Nadav Amit wrote:
> I just thought that there might be some insinuation, as you mentioned VMware
> by name. My response was half-jokingly and should have had a smiley to
> prevent you from wasting your time on the explanation.
No problem, actually I apprec
On Tue, Jan 05, 2021 at 07:26:43PM +, Nadav Amit wrote:
> > On Jan 5, 2021, at 10:20 AM, Andrea Arcangeli wrote:
> >
> > On Fri, Dec 25, 2020 at 01:25:29AM -0800, Nadav Amit wrote:
> >> Fixes: 0f8975ec4db2 ("mm: soft-dirty bits for user memory changes
>
On Tue, Jan 05, 2021 at 07:05:22PM +, Nadav Amit wrote:
> > On Jan 5, 2021, at 10:45 AM, Andrea Arcangeli wrote:
> > I just don't like to slow down a feature required in the future for
> > implementing postcopy live snapshotting or other snapshots to userland
> &g
On Tue, Jan 05, 2021 at 01:41:34PM -0500, Peter Xu wrote:
> Agreed. I didn't mention uffd_wp check (which I actually mentioned in the
> reply
> to v1 patchset) here only because the uffd_wp check is pure optimization;
> while
Agreed it's a pure optimization.
Only if we used the group lock to fi
On Mon, Jan 04, 2021 at 09:26:33PM +, Nadav Amit wrote:
> I would feel more comfortable if you provide patches for uffd-wp. If you
> want, I will do it, but I restate that I do not feel comfortable with this
> solution (worried as it seems a bit ad-hoc and might leave out a scenario
> we all mi
On Fri, Dec 25, 2020 at 01:25:29AM -0800, Nadav Amit wrote:
> Fixes: 0f8975ec4db2 ("mm: soft-dirty bits for user memory changes tracking")
Targeting a backport down to 2013 when nothing could wrong in practice
with page_mapcount sounds backwards and unnecessarily risky.
In theory it was already b
On Tue, Jan 05, 2021 at 10:08:13AM -0500, Peter Xu wrote:
> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index ab709023e9aa..c08c4055b051 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -75,7 +75,8 @@ static unsigned lo
On Tue, Jan 05, 2021 at 04:37:27PM +0100, Peter Zijlstra wrote:
> (your other email clarified this point; the COW needs to copy while
> holding the PTL and we need TLBI under PTL if we're to change this)
The COW doesn't need to hold the PT lock, the TLBI broadcast doesn't
need to be delivered unde
On Tue, Jan 05, 2021 at 09:58:57AM +0100, Peter Zijlstra wrote:
> On Mon, Jan 04, 2021 at 02:24:38PM -0500, Andrea Arcangeli wrote:
> > On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote:
> > > On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
> > >
On Mon, Jan 04, 2021 at 08:39:37PM +, Nadav Amit wrote:
> > On Jan 4, 2021, at 12:19 PM, Andrea Arcangeli wrote:
> >
> > On Mon, Jan 04, 2021 at 07:35:06PM +, Nadav Amit wrote:
> >>> On Jan 4, 2021, at 11:24 AM, Andrea Arcangeli wrote:
> >>>
On Mon, Jan 04, 2021 at 07:35:06PM +, Nadav Amit wrote:
> > On Jan 4, 2021, at 11:24 AM, Andrea Arcangeli wrote:
> >
> > Hello,
> >
> > On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote:
> >> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nad
Hello,
On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
>
> > The scenario that happens in selftests/vm/userfaultfd is as follows:
> >
> > cpu0cpu1cpu2
> >
Hello Mike,
On Sun, Jan 03, 2021 at 03:47:53PM +0200, Mike Rapoport wrote:
> Thanks for the logs, it seems that implicitly adding reserved regions to
> memblock.memory wasn't that bright idea :)
Would it be possible to somehow clean up the hack then?
The only difference between the clean solutio
On Thu, Dec 24, 2020 at 01:49:45PM -0500, Andrea Arcangeli wrote:
> Without the above, can't the CPU decrement the tlb_flush_pending while
> the IPI to ack didn't arrive yet in csd_lock_wait?
Ehm: csd_lock_wait has smp_acquire__after_ctrl_dep() so the write side
looks ok after all sorry.
On Wed, Dec 23, 2020 at 09:18:09PM -0800, Nadav Amit wrote:
> I am not trying to be argumentative, and I did not think through about an
> alternative solution. It sounds to me that your proposed solution is correct
> and would probably be eventually (slightly) more efficient than anything
> that I
> On Wed, Dec 23, 2020 at 07:09:10PM -0800, Nadav Amit wrote:
> > I think there are other cases in which Andy’s concern is relevant
> > (MADV_PAGEOUT).
I didn't try to figure how it would help MADV_PAGEOUT. MADV_PAGEOUT
sounds cool feature, but maybe it'd need a way to flush the
invalidates out an
On Wed, Dec 23, 2020 at 09:00:26PM -0500, Andrea Arcangeli wrote:
> One other near zero cost improvement easy to add if this would be "if
> (vma->vm_flags & (VM_SOFTDIRTY|VM_UFFD_WP))" and it could be made
The next worry then is if UFFDIO_WRITEPROTECT is very large then th
On Wed, Dec 23, 2020 at 05:21:43PM -0800, Andy Lutomirski wrote:
> I don’t love this as a long term fix. AFAICT we can have mm_tlb_flush_pending
> set for quite a while — mprotect seems like it can wait in IO while splitting
> a huge page, for example. That gives us a window in which every write
Hello Linus,
On Wed, Dec 23, 2020 at 03:39:53PM -0800, Linus Torvalds wrote:
> On Wed, Dec 23, 2020 at 1:39 PM Andrea Arcangeli wrote:
> >
> > On Tue, Dec 22, 2020 at 08:36:04PM -0700, Yu Zhao wrote:
> > > Thanks for the details.
> >
> > I hope we can find a
On Wed, Dec 23, 2020 at 02:45:59PM -0800, Nadav Amit wrote:
> I think it may be reasonable.
Whatever solution used, there will be 2 users of it: uffd-wp will use
whatever technique used by clear_refs_write to avoid the
mmap_write_lock.
My favorite is Yu's patch and not the group lock anymore. The
On Wed, Dec 23, 2020 at 03:29:51PM -0700, Yu Zhao wrote:
> I was hesitant to suggest the following because it isn't that straight
> forward. But since you seem to be less concerned with the complexity,
> I'll just bring it on the table -- it would take care of both ufd and
> clear_refs_write, would
On Tue, Dec 22, 2020 at 04:40:32AM -0800, Nadav Amit wrote:
> > On Dec 21, 2020, at 1:24 PM, Yu Zhao wrote:
> >
> > On Mon, Dec 21, 2020 at 12:26:22PM -0800, Linus Torvalds wrote:
> >> On Mon, Dec 21, 2020 at 12:23 PM Nadav Amit wrote:
> >>> Using mmap_write_lock() was my initial fix and there w
On Tue, Dec 22, 2020 at 08:36:04PM -0700, Yu Zhao wrote:
> Thanks for the details.
I hope we can find a way put the page_mapcount back where there's a
page_count right now.
If you're so worried about having to maintain a all defined well
documented (or to be documented even better if you ACK it)
On Wed, Dec 23, 2020 at 10:52:35AM -0500, Peter Xu wrote:
> On Tue, Dec 22, 2020 at 08:36:04PM -0700, Yu Zhao wrote:
> > In your patch, do we need to take wrprotect_rwsem in
> > handle_userfault() as well? Otherwise, it seems userspace would have
> > to synchronize between its wrprotect ioctl and f
On Wed, Dec 23, 2020 at 01:51:59PM -0500, Andrea Arcangeli wrote:
> NOTE: about the above comment, that mprotect takes
> mmap_read_lock. Your above code change in the commit above, still has
write
Correction to avoid any confusion.
On Wed, Dec 23, 2020 at 11:24:16AM -0500, Peter Xu wrote:
> I think this is not against Linus's example - where cpu2 does not have tlb
> cached so it sees RO while cpu3 does have tlb cached so cpu3 can still modify
> it. So IMHO there's no problem here.
>
> But I do think in step 2 here we overlo
On Tue, Dec 22, 2020 at 05:23:39PM -0700, Yu Zhao wrote:
> and 2) people are spearheading multiple efforts to reduce the mmap_lock
> contention, which hopefully would make ufd users suffer less soon.
In my view UFFD is an already deployed working solution that
eliminates the mmap_lock_write conten
On Tue, Dec 22, 2020 at 04:39:46PM -0700, Yu Zhao wrote:
> We are talking about non-COW anon pages here -- they can't be mapped
> more than once. So why not just identify them by checking
> page_mapcount == 1 and then unconditionally reuse them? (This is
> probably where I've missed things.)
The p
On Tue, Dec 22, 2020 at 02:14:41PM -0700, Yu Zhao wrote:
> This works but I don't prefer this option because 1) this is new
> way of making pte_wrprotect safe and 2) it's specific to ufd and
> can't be applied to clear_soft_dirty() which has no "catcher". No
I didn't look into clear_soft_dirty iss
On Tue, Dec 22, 2020 at 12:58:18PM -0800, Nadav Amit wrote:
> I had somewhat similar ideas - saving in each page-struct the generation,
> which would allow to: (1) extend pte_same() to detect interim changes
> that were reverted (RO->RW->RO) and (2) per-PTE pending flushes.
What don't you feel saf
On Tue, Dec 22, 2020 at 12:19:49PM -0800, Nadav Amit wrote:
> Perhaps any change to PTE in a page-table should increase a page-table
> generation that we would save in the page-table page-struct (next to the
The current rule is that by the time in the page fault we find a
pte/hugepmd in certain st
On Tue, Dec 22, 2020 at 08:15:53PM +, Matthew Wilcox wrote:
> On Tue, Dec 22, 2020 at 02:31:52PM -0500, Andrea Arcangeli wrote:
> > My previous suggestion to use a mutex to serialize
> > userfaultfd_writeprotect with a mutex will still work, but we can run
> > as
On Mon, Dec 21, 2020 at 02:55:12PM -0800, Nadav Amit wrote:
> wouldn’t mmap_write_downgrade() be executed before mprotect_fixup() (so
I assume you mean "in" mprotect_fixup, after change_protection.
If you would downgrade the mmap_lock to read there, then it'd severely
slowdown the non contention
be more intrusive
in the VM and it's overall unnecessary.
The below is mostly untested... but it'd be good to hear some feedback
before doing more work in this direction.
>From 4ace4d1b53f5cb3b22a5c2dc33facc4150b112d6 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli
Date: Tue, 22 Dec
Hello,
On Sat, Dec 19, 2020 at 09:08:55PM -0800, Andy Lutomirski wrote:
> On Sat, Dec 19, 2020 at 6:49 PM Andrea Arcangeli wrote:
> > The ptes are changed always with the PT lock, in fact there's no
> > problem with the PTE updates. The only difference with mprotect
>
On Sat, Dec 19, 2020 at 06:01:39PM -0800, Andy Lutomirski wrote:
> I missed the beginning of this thread, but it looks to me like
> userfaultfd changes PTEs with not locking except mmap_read_lock(). It
There's no mmap_read_lock, I assume you mean mmap_lock for reading.
The ptes are changed alway
On Sat, Dec 19, 2020 at 02:06:02PM -0800, Nadav Amit wrote:
> > On Dec 19, 2020, at 1:34 PM, Nadav Amit wrote:
> >
> > [ cc’ing some more people who have experience with similar problems ]
> >
> >> On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli wrote:
> &g
Hello,
On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote:
> Analyzing this problem indicates that there is a real bug since
> mmap_lock is only taken for read in mwriteprotect_range(). This might
Never having to take the mmap_sem for writing, and in turn never
blocking, in order to modif
; >>> memblock.memory
> >>> before calculating node and zone boundaries.
> >>>
> >>> Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions
> >>> rather that check each PFN")
> >>> Reported-by: Andrea Arcangel
Hello,
On Wed, Dec 09, 2020 at 11:43:04PM +0200, Mike Rapoport wrote:
> +void __init __weak memmap_init(unsigned long size, int nid,
> +unsigned long zone,
> +unsigned long range_start_pfn)
> +{
> + unsigned long start_pfn, end_pfn, hole_
Hi Mel,
On Thu, Nov 26, 2020 at 10:47:20AM +, Mel Gorman wrote:
> Agreed. This thread has a lot of different directions in it at this
> point so what I'd hope for is first, a patch that initialises holes with
> zone/node linkages within a 1<<(MAX_ORDER-1) alignment. If there is a
> hole, it wo
at is not part of the
zone in order to boot with the very fix that was meant to prevent such
invariant to be broken in the first place.
I don't think pfn 0 deserves a magic exception and a pass to break the
invariant (even ignoring it may happen that the first pfn in nid > 0
might then also g
ed to be extended to
include all memblock.reserved ranges with struct pages too or they'll
be left uninitialized with PagePoison as it happened to pfn 0.
Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather
that check each PFN")
Signed-off-by: Andrea Arcangeli
On Fri, Dec 04, 2020 at 02:23:29PM -0500, Peter Xu wrote:
> If we see [1]:
>
> if (!pte_present && !pte_none && pte_swp_uffd_wp && not_anonymous_vma &&
> !is_migration_entry)
>
> Then it's fundamentally the same as:
>
> swp_entry(0, _UFFD_SWP_UFFD_WP) && !vma_is_anonymous(vma)
Yes conceptu
On Thu, Dec 03, 2020 at 11:10:18PM -0500, Andrea Arcangeli wrote:
> from the pte, one that cannot ever be set in any swp entry today. I
> assume it can't be _PAGE_SWP_UFFD_WP since that already can be set but
> you may want to verify it...
I thought more about the above, and I thi
Hi Peter,
On Thu, Dec 03, 2020 at 09:30:51PM -0500, Peter Xu wrote:
> I'm just afraid there's no space left for a migration entry, because migration
> entries fills in the pfn information into swp offset field rather than a real
> offset (please refer to make_migration_entry())? I assume PFN can
On Thu, Dec 03, 2020 at 01:02:34PM -0500, Peter Xu wrote:
> On Wed, Dec 02, 2020 at 09:36:45PM -0800, Hugh Dickins wrote:
> > On Wed, 2 Dec 2020, Peter Xu wrote:
> > > On Wed, Dec 02, 2020 at 02:37:33PM -0800, Hugh Dickins wrote:
> > > > On Tue, 1 De
Hello,
On Thu, Dec 03, 2020 at 08:25:49AM +0200, Mike Rapoport wrote:
> On Wed, Dec 02, 2020 at 03:47:36PM -0800, Andrew Morton wrote:
> > On Tue, 1 Dec 2020 20:15:02 +0200 Mike Rapoport wrote:
> >
> > > From: Mike Rapoport
> > >
> > > There could be struct pages that are not backed by actual
On Thu, Dec 03, 2020 at 12:51:07PM +0200, Mike Rapoport wrote:
> On Thu, Dec 03, 2020 at 01:23:02AM -0500, Andrea Arcangeli wrote:
> > 5) pfn 0 is the classical case where pfn 0 is in a reserved zone in
> >memblock.reserve that doesn't overlap any memblock.memory zone.
>
zone_end_pfn() 0contiguous 0
Movable zone_start_pfn 0zone_end_pfn() 0contiguous 0
500246 0x7a216000 0xfff1000 reserved True
500247 0x7a217000 0x1fff reserved False
500248 0x7a218000 0x1fff00010200 reserved False ]
quote from previou
Hello Mike,
On Sun, Nov 29, 2020 at 02:32:57PM +0200, Mike Rapoport wrote:
> Hello Andrea,
>
> On Thu, Nov 26, 2020 at 07:46:05PM +0200, Mike Rapoport wrote:
> > On Thu, Nov 26, 2020 at 11:05:14AM +0100, David Hildenbrand wrote:
> >
> > Let's try to merge init_unavailable_memory() into memmap_in
On Tue, Dec 01, 2020 at 05:30:33PM -0500, Peter Xu wrote:
> On Tue, Dec 01, 2020 at 12:59:27PM +, Matthew Wilcox wrote:
> > On Mon, Nov 30, 2020 at 06:06:03PM -0500, Peter Xu wrote:
> > > Faulting around for reads are in most cases helpful for the performance
> > > so that
> > > continuous mem
Hi Hugh,
On Tue, Dec 01, 2020 at 01:31:21PM -0800, Hugh Dickins wrote:
> Please don't ever rely on that i_private business for correctness: as
The out of order and lockless "if (inode->i_private)" branch didn't
inspire much confidence in terms of being able to rely on it for
locking correctness i
Hi Peter,
On Sat, Nov 28, 2020 at 10:29:03AM -0500, Peter Xu wrote:
> Yes. A trivial supplementary detail is that filemap_map_pages() will only set
> it read-only since alloc_set_pte() will only set write bit if it's a write.
> In
> our case it's read fault-around so without it. However it'll
stop faulting in pages without checking
inode->i_private
Per shmem_fault comment shmem need to "refrain from faulting pages
into the hole while it's being punched" and to do so it must check
inode->i_private, which filemap_map_pages won't so it's unsafe to use
in shmem becau
On Thu, Nov 26, 2020 at 09:44:26PM +0200, Mike Rapoport wrote:
> TBH, the whole interaction between e820 and memblock keeps me puzzled
> and I can only make educated guesses why some ranges here are
> memblock_reserve()'d and some memblock_add()ed.
The mixed usage in that interaction between membl
On Thu, Nov 26, 2020 at 11:36:02AM +0200, Mike Rapoport wrote:
> I think it's inveneted by your BIOS vendor :)
BTW, all systems I use on a daily basis have that type 20... Only two
of them are reproducing the VM_BUG_ON on a weekly basis on v5.9.
If you search 'E820 "type 20"' you'll get plenty of
On Thu, Nov 26, 2020 at 11:36:02AM +0200, Mike Rapoport wrote:
> memory.reserved cannot be calculated automatically. It represents all
> the memory allocations made before page allocator is up. And as
> memblock_reserve() is the most basic to allocate memory early at boot we
> cannot really delete
On Thu, Nov 26, 2020 at 11:05:14AM +0100, David Hildenbrand wrote:
> I agree that this is sub-optimal, as such pages are impossible to detect
> (PageReserved is just not clear as discussed with Andrea). The basic
> question is how we want to proceed:
>
> a) Make sure any online struct page has a v
On Wed, Nov 25, 2020 at 12:34:41AM -0500, Andrea Arcangeli wrote:
> pfnphysaddr page->flags
> 500224 0x7a20 0x1fff1000 reserved True
> 500225 0x7a201000 0x1fff1000 reserved True
> *snip*
> 500245 0x7a215000 0x1fff1000 reserved True
&
On Wed, Nov 25, 2020 at 11:04:14PM +0200, Mike Rapoport wrote:
> I think the very root cause is how e820__memblock_setup() registers
> memory with memblock:
>
> if (entry->type == E820_TYPE_SOFT_RESERVED)
> memblock_reserve(entry->addr, entry->size);
>
>
On Wed, Nov 25, 2020 at 08:27:21PM +0100, David Hildenbrand wrote:
> On 25.11.20 19:28, Andrea Arcangeli wrote:
> > On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote:
> >> Before that change, the memmap of memory holes were only zeroed
> >> out. So t
On Wed, Nov 25, 2020 at 04:13:25PM +0200, Mike Rapoport wrote:
> I suspect that memmap for the reserved pages is not properly initialized
> after recent changes in free_area_init(). They are cleared at
> init_unavailable_mem() to have zone=0 and node=0, but they seem to be
I'd really like if we wo
On Wed, Nov 25, 2020 at 01:08:54PM +0100, Vlastimil Babka wrote:
> Yeah I guess it would be simpler if zoneid/nid was correct for
> pfn_valid() pfns within a zone's range, even if they are reserved due
> not not being really usable memory.
>
> I don't think we want to introduce CONFIG_HOLES_IN_Z
On Wed, Nov 25, 2020 at 12:41:55PM +0100, David Hildenbrand wrote:
> On 25.11.20 12:04, David Hildenbrand wrote:
> > On 25.11.20 11:39, Mel Gorman wrote:
> >> On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote:
> Something must have changed more recently than v5.1 that caused th
On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote:
> Before that change, the memmap of memory holes were only zeroed
> out. So the zones/nid was 0, however, pages were not reserved and
> had a refcount of zero - resulting in other issues.
So maybe that "0,0" zoneid/nid was not actu
On Wed, Nov 25, 2020 at 10:30:53AM +, Mel Gorman wrote:
> On Tue, Nov 24, 2020 at 03:56:22PM -0500, Andrea Arcangeli wrote:
> > Hello,
> >
> > On Tue, Nov 24, 2020 at 01:32:05PM +, Mel Gorman wrote:
> > > I would hope that is not the case because
Hello,
On Mon, Nov 23, 2020 at 02:01:16PM +0100, Vlastimil Babka wrote:
> On 11/21/20 8:45 PM, Andrea Arcangeli wrote:
> > A corollary issue was fixed in
> > 39639000-39814fff : Unknown E820 type
> >
> > pfn 0x7a200 -> 0x7a20 min_pfn hit non-RAM:
> >
262671e88723b3074251189004ceae39dcd1689d Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli
Date: Sat, 21 Nov 2020 12:55:58 -0500
Subject: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set
pageblock_skip on reserved pages
A corollary issue was fixed in
e577c8b64d58fe307ea4d
On Sat, Nov 21, 2020 at 02:45:06PM -0500, Andrea Arcangeli wrote:
> + if (likely(!PageReserved(page)))
NOTE: this line will have to become "likely(page &&
!PageReserved(page))" to handle the case of non contiguous zones,
since pageblock_pfn
gt; PAGE_SHIFT;
I didn't try to inject the bug to validate the fix and it'd be great
if someone can try that to validate this or any other fix.
Andrea Arcangeli (1):
mm: compaction: avoid fast_isolate_around() to set pageblock_skip on
reserved pages
mm/compaction.c | 5 -
1 file changed, 4 insertions(+), 1 deletion(-)
e PageBuddy check, except in
the new fast_isolate_around() path).
Fixes: 5a811889de10 ("mm, compaction: use free lists to quickly locate a
migration target")
Signed-off-by: Andrea Arcangeli
---
mm/compaction.c | 5 -
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/mm/com
1 - 100 of 1636 matches
Mail list logo