On Tue, Apr 29, 2025 at 8:40 AM Jann Horn <ja...@google.com> wrote: > > On Fri, Apr 18, 2025 at 7:50 PM Suren Baghdasaryan <sur...@google.com> wrote: > > With maple_tree supporting vma tree traversal under RCU and vma and > > its important members being RCU-safe, /proc/pid/maps can be read under > > RCU and without the need to read-lock mmap_lock. However vma content > > can change from under us, therefore we make a copy of the vma and we > > pin pointer fields used when generating the output (currently only > > vm_file and anon_name). Afterwards we check for concurrent address > > space modifications, wait for them to end and retry. While we take > > the mmap_lock for reading during such contention, we do that momentarily > > only to record new mm_wr_seq counter. This change is designed to reduce > > mmap_lock contention and prevent a process reading /proc/pid/maps files > > (often a low priority task, such as monitoring/data collection services) > > from blocking address space updates. > [...] > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > index b9e4fbbdf6e6..f9d50a61167c 100644 > > --- a/fs/proc/task_mmu.c > > +++ b/fs/proc/task_mmu.c > [...] > > +/* > > + * Take VMA snapshot and pin vm_file and anon_name as they are used by > > + * show_map_vma. > > + */ > > +static int get_vma_snapshot(struct proc_maps_private *priv, struct > > vm_area_struct *vma) > > +{ > > + struct vm_area_struct *copy = &priv->vma_copy; > > + int ret = -EAGAIN; > > + > > + memcpy(copy, vma, sizeof(*vma)); > > + if (copy->vm_file && !get_file_rcu(©->vm_file)) > > + goto out; > > I think this uses get_file_rcu() in a different way than intended. > > As I understand it, get_file_rcu() is supposed to be called on a > pointer which always points to a file with a non-zero refcount (except > when it is NULL). That's why it takes a file** instead of a file* - if > it observes a zero refcount, it assumes that the pointer must have > been updated in the meantime, and retries. Calling get_file_rcu() on a > pointer that points to a file with zero refcount, which I think can > happen with this patch, will cause an endless loop. > (Just as background: For other usecases, get_file_rcu() is supposed to > still behave nicely and not spuriously return NULL when the file* is > concurrently updated to point to another file*; that's what that loop > is for.)
Ah, I see. I wasn't aware of this subtlety. I think this is fixable by checking the return value of get_file_rcu() and retrying speculation if it changed. > (If my understanding is correct, maybe we should document that more > explicitly...) Good point. I'll add comments for get_file_rcu() as a separate patch. > > Also, I think you are introducing an implicit assumption that > remove_vma() does not NULL out the ->vm_file pointer (because that > could cause tearing and could theoretically lead to a torn pointer > being accessed here). > > One alternative might be to change the paths that drop references to > vma->vm_file (search for vma_close to find them) such that they first > NULL out ->vm_file with a WRITE_ONCE() and do the fput() after that, > maybe with a new helper like this: > > static void vma_fput(struct vm_area_struct *vma) > { > struct file *file = vma->vm_file; > > if (file) { > WRITE_ONCE(vma->vm_file, NULL); > fput(file); > } > } > > Then on the lockless lookup path you could use get_file_rcu() on the > ->vm_file pointer _of the original VMA_, and store the returned file* > into copy->vm_file. Ack. Except for storing the return value of get_file_rcu(). I think once we detect that get_file_rcu() returns a different file we should bail out and retry. The change in file is an indication that the vma got changed from under us, so whatever we have is stale. > > > + if (!anon_vma_name_get_if_valid(copy)) > > + goto put_file; > > + > > + if (!mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq)) > > + return 0; > > We only check for concurrent updates at this point, so up to here, > anything we read from "copy" could contain torn pointers (both because > memcpy() is not guaranteed to copy pointers atomically and because the > updates to the original VMA are not done with WRITE_ONCE()). > That probably means that something like the preceding > anon_vma_name_get_if_valid() could crash on an access to a torn > pointer. > Please either do another mmap_lock_speculate_retry() check directly > after the memcpy(), or ensure nothing before this point reads from > "copy". Ack. I'll add mmap_lock_speculate_retry() check right after memcpy(). > > > + /* Address space got modified, vma might be stale. Re-lock and > > retry. */ > > + rcu_read_unlock(); > > + ret = mmap_read_lock_killable(priv->mm); > > + if (!ret) { > > + /* mmap_lock_speculate_try_begin() succeeds when holding > > mmap_read_lock */ > > + mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq); > > + mmap_read_unlock(priv->mm); > > + ret = -EAGAIN; > > + } > > + > > + rcu_read_lock(); > > + > > + anon_vma_name_put_if_valid(copy); > > +put_file: > > + if (copy->vm_file) > > + fput(copy->vm_file); > > +out: > > + return ret; > > +} > [...] > > @@ -266,39 +399,41 @@ static void get_vma_name(struct vm_area_struct *vma, > > } else { > > *path = file_user_path(vma->vm_file); > > } > > - return; > > + goto out; > > } > > > > if (vma->vm_ops && vma->vm_ops->name) { > > *name = vma->vm_ops->name(vma); > > This seems to me like a big, subtle change of semantics. After this > change, vm_ops->name() will no longer receive a real VMA; and in > particular, I think the .name implementation special_mapping_name used > in special_mapping_vmops will have a UAF because it relies on > vma->vm_private_data pointing to a live object. Ah, I see. IOW, vma->vm_private_data might change from under us and I don't detect that. Moreover this is just an example and .name() might depend on other things. > > I think you'll need to fall back to using the mmap lock and the real > VMA if you see a non-NULL vma->vm_ops->name pointer. Yeah, either that or obtain the name and make a copy during get_vma_snapshot() using original vma. Will need to check which way is better. > > > if (*name) > > - return; > > + goto out; > > } > > > > *name = arch_vma_name(vma); > > if (*name) > > - return; > > + goto out; > > > > if (!vma->vm_mm) { > > *name = "[vdso]"; > > - return; > > + goto out; > > } > > > > if (vma_is_initial_heap(vma)) { > > *name = "[heap]"; > > - return; > > + goto out; > > } > > > > if (vma_is_initial_stack(vma)) { > > *name = "[stack]"; > > - return; > > + goto out; > > } > > > > if (anon_name) { > > *name_fmt = "[anon:%s]"; > > *name = anon_name->name; > > - return; > > } > > +out: > > + if (anon_name && !mmap_locked) > > + anon_vma_name_put(anon_name); > > Isn't this refcount drop too early, causing UAF read? We drop the > reference on the anon_name here, but (on some paths) we're about to > return anon_name->name to the caller through *name, and the caller > will read from it. > > Ah, but I guess it's actually fine because the refcount increment was > unnecessary in the first place, because the vma pointer actually > points to a copy of the original VMA, and the copy has its own > refcounted reference to the anon_name thanks to get_vma_snapshot()? > It might be helpful to have some comments documenting which VMA > pointers can point to copies. If I follow Andrii's suggestion and avoid vma copying I'll need to implement careful handling of pointers and allow get_vma_name() to fail indicating that vma changed from under us. Let me see if this is still doable in the light of your findings. Thanks for the insightful review and welcome back!