On Tue, Apr 29, 2025 at 10:21 AM Jann Horn <ja...@google.com> wrote: > > Hi! > > (I just noticed that I incorrectly assumed that VMAs use kfree_rcu > (not SLAB_TYPESAFE_BY_RCU) when I wrote my review of this, somehow I > forgot all about that...)
Does this fact affect your previous comments? Just want to make sure I'm not missing something... > > On Tue, Apr 29, 2025 at 7:09 PM Suren Baghdasaryan <sur...@google.com> wrote: > > On Tue, Apr 29, 2025 at 8:40 AM Jann Horn <ja...@google.com> wrote: > > > On Fri, Apr 18, 2025 at 7:50 PM Suren Baghdasaryan <sur...@google.com> > > > wrote: > > > > With maple_tree supporting vma tree traversal under RCU and vma and > > > > its important members being RCU-safe, /proc/pid/maps can be read under > > > > RCU and without the need to read-lock mmap_lock. However vma content > > > > can change from under us, therefore we make a copy of the vma and we > > > > pin pointer fields used when generating the output (currently only > > > > vm_file and anon_name). Afterwards we check for concurrent address > > > > space modifications, wait for them to end and retry. While we take > > > > the mmap_lock for reading during such contention, we do that momentarily > > > > only to record new mm_wr_seq counter. This change is designed to reduce > > > > mmap_lock contention and prevent a process reading /proc/pid/maps files > > > > (often a low priority task, such as monitoring/data collection services) > > > > from blocking address space updates. > > > [...] > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > > index b9e4fbbdf6e6..f9d50a61167c 100644 > > > > --- a/fs/proc/task_mmu.c > > > > +++ b/fs/proc/task_mmu.c > > > [...] > > > > +/* > > > > + * Take VMA snapshot and pin vm_file and anon_name as they are used by > > > > + * show_map_vma. > > > > + */ > > > > +static int get_vma_snapshot(struct proc_maps_private *priv, struct > > > > vm_area_struct *vma) > > > > +{ > > > > + struct vm_area_struct *copy = &priv->vma_copy; > > > > + int ret = -EAGAIN; > > > > + > > > > + memcpy(copy, vma, sizeof(*vma)); > > > > + if (copy->vm_file && !get_file_rcu(©->vm_file)) > > > > + goto out; > > > > > > I think this uses get_file_rcu() in a different way than intended. > > > > > > As I understand it, get_file_rcu() is supposed to be called on a > > > pointer which always points to a file with a non-zero refcount (except > > > when it is NULL). That's why it takes a file** instead of a file* - if > > > it observes a zero refcount, it assumes that the pointer must have > > > been updated in the meantime, and retries. Calling get_file_rcu() on a > > > pointer that points to a file with zero refcount, which I think can > > > happen with this patch, will cause an endless loop. > > > (Just as background: For other usecases, get_file_rcu() is supposed to > > > still behave nicely and not spuriously return NULL when the file* is > > > concurrently updated to point to another file*; that's what that loop > > > is for.) > > > > Ah, I see. I wasn't aware of this subtlety. I think this is fixable by > > checking the return value of get_file_rcu() and retrying speculation > > if it changed. > > I think you could probably still end up looping endlessly in get_file_rcu(). By "retrying speculation" I meant it in the sense of get_vma_snapshot() retry when it takes the mmap_read_lock and then does mmap_lock_speculate_try_begin to restart speculation. I'm also thinking about Liam's concern of guaranteeing forward progress for the reader. Thinking maybe I should not drop mmap_read_lock immediately on contention but generate some output (one vma or one page worth of vmas) before dropping mmap_read_lock and proceeding with speculation. > > > > (If my understanding is correct, maybe we should document that more > > > explicitly...) > > > > Good point. I'll add comments for get_file_rcu() as a separate patch. > > > > > > > > Also, I think you are introducing an implicit assumption that > > > remove_vma() does not NULL out the ->vm_file pointer (because that > > > could cause tearing and could theoretically lead to a torn pointer > > > being accessed here). > > > > > > One alternative might be to change the paths that drop references to > > > vma->vm_file (search for vma_close to find them) such that they first > > > NULL out ->vm_file with a WRITE_ONCE() and do the fput() after that, > > > maybe with a new helper like this: > > > > > > static void vma_fput(struct vm_area_struct *vma) > > > { > > > struct file *file = vma->vm_file; > > > > > > if (file) { > > > WRITE_ONCE(vma->vm_file, NULL); > > > fput(file); > > > } > > > } > > > > > > Then on the lockless lookup path you could use get_file_rcu() on the > > > ->vm_file pointer _of the original VMA_, and store the returned file* > > > into copy->vm_file. > > > > Ack. Except for storing the return value of get_file_rcu(). I think > > once we detect that get_file_rcu() returns a different file we should > > bail out and retry. The change in file is an indication that the vma > > got changed from under us, so whatever we have is stale. > > What does "different file" mean here - what file* would you compare > the returned one against? Inside get_vma_snapshot() I would pass the original vma->vm_file to get_file_rcu() and check if it returns the same value. If the value got changed we jump to /* Address space got modified, vma might be stale. Re-lock and retry. */ section. That should work, right?