On 7/10/25 19:02, Suren Baghdasaryan wrote: > On Thu, Jul 10, 2025 at 12:03 AM Suren Baghdasaryan <sur...@google.com> wrote: >> >> On Wed, Jul 9, 2025 at 10:47 AM Suren Baghdasaryan <sur...@google.com> wrote: >> > >> > On Wed, Jul 9, 2025 at 4:12 PM Liam R. Howlett <liam.howl...@oracle.com> >> > wrote: >> > > >> > > * Suren Baghdasaryan <sur...@google.com> [250709 11:06]: >> > > > On Wed, Jul 9, 2025 at 3:03 PM Vlastimil Babka <vba...@suse.cz> wrote: >> > > > > >> > > > > On 7/9/25 16:43, Suren Baghdasaryan wrote: >> > > > > > On Wed, Jul 9, 2025 at 1:57 AM Vlastimil Babka <vba...@suse.cz> >> > > > > > wrote: >> > > > > >> >> > > > > >> On 7/8/25 01:10, Suren Baghdasaryan wrote: >> > > > > >> >>> + rcu_read_unlock(); >> > > > > >> >>> + vma = lock_vma_under_mmap_lock(mm, iter, address); >> > > > > >> >>> + rcu_read_lock(); >> > > > > >> >> OK I guess we hold the RCU lock the whole time as we traverse >> > > > > >> >> except when >> > > > > >> >> we lock under mmap lock. >> > > > > >> > Correct. >> > > > > >> >> > > > > >> I wonder if it's really necessary? Can't it be done just inside >> > > > > >> lock_next_vma()? It would also avoid the unlock/lock dance quoted >> > > > > >> above. >> > > > > >> >> > > > > >> Even if we later manage to extend this approach to smaps and >> > > > > >> employ rcu >> > > > > >> locking to traverse the page tables, I'd think it's best to >> > > > > >> separate and >> > > > > >> fine-grain the rcu lock usage for vma iterator and page tables, >> > > > > >> if only to >> > > > > >> avoid too long time under the lock. >> > > > > > >> > > > > > I thought we would need to be in the same rcu read section while >> > > > > > traversing the maple tree using vma_next() but now looking at it, >> > > > > > maybe we can indeed enter only while finding and locking the next >> > > > > > vma... >> > > > > > Liam, would that work? I see struct ma_state containing a node >> > > > > > field. >> > > > > > Can it be freed from under us if we find a vma, exit rcu read >> > > > > > section >> > > > > > then re-enter rcu and use the same iterator to find the next vma? >> > > > > >> > > > > If the rcu protection needs to be contigous, and patch 8 avoids the >> > > > > issue by >> > > > > always doing vma_iter_init() after rcu_read_lock() (but does it >> > > > > really avoid >> > > > > the issue or is it why we see the syzbot reports?) then I guess in >> > > > > the code >> > > > > quoted above we also need a vma_iter_init() after the >> > > > > rcu_read_lock(), >> > > > > because although the iterator was used briefly under mmap_lock >> > > > > protection, >> > > > > that was then unlocked and there can be a race before the >> > > > > rcu_read_lock(). >> > > > >> > > > Quite true. So, let's wait for Liam's confirmation and based on his >> > > > answer I'll change the patch by either reducing the rcu read section >> > > > or adding the missing vma_iter_init() after we switch to mmap_lock. >> > > >> > > You need to either be under rcu or mmap lock to ensure the node in the >> > > maple state hasn't been freed (and potentially, reallocated). >> > > >> > > So in this case, in the higher level, we can hold the rcu read lock for >> > > a series of walks and avoid re-walking the tree then the performance >> > > would be better. >> > >> > Got it. Thanks for confirming! >> > >> > > >> > > When we return to userspace, then we should drop the rcu read lock and >> > > will need to vma_iter_set()/vma_iter_invalidate() on return. I thought >> > > this was being done (through vma_iter_init()), but syzbot seems to >> > > indicate a path that was missed? >> > >> > We do that in m_start()/m_stop() by calling >> > lock_vma_range()/unlock_vma_range() but I think I have two problems >> > here: >> > 1. As Vlastimil mentioned I do not reset the iterator when falling >> > back to mmap_lock and exiting and then re-entering rcu read section; >> > 2. I do not reset the iterator after exiting rcu read section in >> > m_stop() and re-entering it in m_start(), so the later call to >> > lock_next_vma() might be using an iterator with a node that was freed >> > (and possibly reallocated). >> > >> > > >> > > This is the same thing that needed to be done previously with the mmap >> > > lock, but now under the rcu lock. >> > > >> > > I'm not sure how to mitigate the issue with the page table, maybe we >> > > guess on the number of vmas that we were doing for 4k blocks of output >> > > and just drop/reacquire then. Probably a problem for another day >> > > anyways. >> > > >> > > Also, I think you can also change the vma_iter_init() to vma_iter_set(), >> > > which is slightly less code under the hood. Vlastimil asked about this >> > > and it's probably a better choice. >> > >> > Ack. >> > I'll update my series with these fixes and all comments I received so >> > far, will run the reproducers to confirm no issues and repost them >> > later today. >> >> I have the patchset ready but would like to test it some more. Will >> post it tomorrow. > > Ok, I found a couple of issues using the syzbot reproducer [1] (which > is awesome BTW!): > 1. rwsem_acquire_read() inside vma_start_read() at [2] should be moved > after the last check, otherwise the lock is considered taken on > vma->vm_refcnt overflow;
I think it's fine because if the last check fails there's a vma_refcount_put() that includes rwsem_release(), no? > 2. query_matching_vma() is missing unlock_vma() call when it does > "goto next_vma;" and re-issues query_vma_find_by_addr(). The previous > vma is left locked; > > [1] https://syzkaller.appspot.com/x/repro.c?x=101edf70580000 > [2] https://elixir.bootlin.com/linux/v6.15.5/source/include/linux/mm.h#L747 > > After these fixes it's much harder to fail but I still get one more > error copied below. I will continue the investigation and will hold > off reposting until this is fixed. That will be next week since I'll > be out of town the rest of this week. > > Andrew, could you please remove this patchset from mm-unstable for now > until I fix the issue and re-post the new version? Andrew can you do that please? We keep getting new syzbot reports. > The error I got after these fixes is: I suspect the root cause is the ioctls are not serialized against each other (probably not even against read()) and yet we treat m->private as safe to work on. Now we have various fields that are dangerous to race on - for example locked_vma and iter races would explain a lot of this. I suspect as long as we used purely seq_file workflow, it did the right thing for us wrt serialization, but the ioctl addition violates that. We should rather recheck even the code before this series, if dangerous ioctl vs read() races are possible. And the ioctl implementation should be refactored to use an own per-ioctl-call private context, not the seq_file's per-file-open context.