On Wed, Apr 23, 2025 at 2:49 PM Suren Baghdasaryan <sur...@google.com> wrote: > > On Tue, Apr 22, 2025 at 3:49 PM Andrii Nakryiko > <andrii.nakry...@gmail.com> wrote: > > > > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan <sur...@google.com> > > wrote: > > > > > > With maple_tree supporting vma tree traversal under RCU and vma and > > > its important members being RCU-safe, /proc/pid/maps can be read under > > > RCU and without the need to read-lock mmap_lock. However vma content > > > can change from under us, therefore we make a copy of the vma and we > > > pin pointer fields used when generating the output (currently only > > > vm_file and anon_name). Afterwards we check for concurrent address > > > space modifications, wait for them to end and retry. While we take > > > the mmap_lock for reading during such contention, we do that momentarily > > > only to record new mm_wr_seq counter. This change is designed to reduce > > > > This is probably a stupid question, but why do we need to take a lock > > just to record this counter? uprobes get away without taking mmap_lock > > even for reads, and still record this seq counter. And then detect > > whether there were any modifications in between. Why does this change > > need more heavy-weight mmap_read_lock to do speculative reads? > > Not a stupid question. mmap_read_lock() is used to wait for the writer > to finish what it's doing and then we continue by recording a new > sequence counter value and call mmap_read_unlock. This is what > get_vma_snapshot() does. But your question made me realize that we can > optimize m_start() further by not taking mmap_read_lock at all. > Instead of taking mmap_read_lock then doing drop_mmap_lock() we can > try mmap_lock_speculate_try_begin() and only if it fails do the same > dance we do in the get_vma_snapshot(). I think that should work.
Ok, yeah, it would be great to avoid taking a lock in a common case! > > > > > > mmap_lock contention and prevent a process reading /proc/pid/maps files > > > (often a low priority task, such as monitoring/data collection services) > > > from blocking address space updates. > > > Note that this change has a userspace visible disadvantage: it allows > > > for sub-page data tearing as opposed to the previous mechanism where > > > data tearing could happen only between pages of generated output data. > > > Since current userspace considers data tearing between pages to be > > > acceptable, we assume is will be able to handle sub-page data tearing > > > as well. > > > > > > Signed-off-by: Suren Baghdasaryan <sur...@google.com> > > > --- > > > fs/proc/internal.h | 6 ++ > > > fs/proc/task_mmu.c | 170 ++++++++++++++++++++++++++++++++++---- > > > include/linux/mm_inline.h | 18 ++++ > > > 3 files changed, 177 insertions(+), 17 deletions(-) > > > > > > diff --git a/fs/proc/internal.h b/fs/proc/internal.h > > > index 96122e91c645..6e1169c1f4df 100644 > > > --- a/fs/proc/internal.h > > > +++ b/fs/proc/internal.h > > > @@ -379,6 +379,12 @@ struct proc_maps_private { > > > struct task_struct *task; > > > struct mm_struct *mm; > > > struct vma_iterator iter; > > > + bool mmap_locked; > > > + loff_t last_pos; > > > +#ifdef CONFIG_PER_VMA_LOCK > > > + unsigned int mm_wr_seq; > > > + struct vm_area_struct vma_copy; > > > +#endif > > > #ifdef CONFIG_NUMA > > > struct mempolicy *task_mempolicy; > > > #endif > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > index b9e4fbbdf6e6..f9d50a61167c 100644 > > > --- a/fs/proc/task_mmu.c > > > +++ b/fs/proc/task_mmu.c > > > @@ -127,13 +127,130 @@ static void release_task_mempolicy(struct > > > proc_maps_private *priv) > > > } > > > #endif > > > > > > -static struct vm_area_struct *proc_get_vma(struct proc_maps_private > > > *priv, > > > - loff_t *ppos) > > > +#ifdef CONFIG_PER_VMA_LOCK > > > + > > > +static const struct seq_operations proc_pid_maps_op; > > > + > > > +/* > > > + * Take VMA snapshot and pin vm_file and anon_name as they are used by > > > + * show_map_vma. > > > + */ > > > +static int get_vma_snapshot(struct proc_maps_private *priv, struct > > > vm_area_struct *vma) > > > +{ > > > + struct vm_area_struct *copy = &priv->vma_copy; > > > + int ret = -EAGAIN; > > > + > > > + memcpy(copy, vma, sizeof(*vma)); > > > + if (copy->vm_file && !get_file_rcu(©->vm_file)) > > > + goto out; > > > + > > > + if (!anon_vma_name_get_if_valid(copy)) > > > + goto put_file; > > > > Given vm_file and anon_vma_name are both RCU-protected, if we take > > rcu_read_lock() at m_start/m_next before fetching VMA, we shouldn't > > even need getting/putting them, no? > > Yeah, anon_vma_name indeed looks safe without pinning but vm_file is > using SLAB_TYPESAFE_BY_RCU cache, so it might still be a valid pointer > but pointing to a wrong object even if the rcu grace period did not > pass. With my assumption that seq_file can't rollback once show_map() > is done, I would need a completely stable vma at the time show_map() > is executed so that it does not change from under us while we are > generating the output. > OTOH, if we indeed can rollback while generating seq_file output then > show_map() could output potentially invalid vma, then check for vma > changes and when detected, rollback seq_file and retry again. > > > > > I feel like I'm missing some important limitations, but I don't think > > they are spelled out explicitly... > > > > > + > > > + if (!mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq)) > > > + return 0; > > > + > > > + /* Address space got modified, vma might be stale. Re-lock and > > > retry. */ > > > + rcu_read_unlock(); > > > > Another question I have is whether we really need to copy vma into > > priv->vma_copy to have a stable snapshot? Can't we just speculate like > > with uprobes under assumption that data doesn't change. And once we > > are done producing text output, confirm that speculation was > > successful, and if not - retry? > > > > We'd need an API for seq_file to rollback to previous good known > > location for that, but that seems straightforward enough to do, no? > > Just remember buffer position before speculation, write data, check > > for no mm modifications, and if something changed, rollback seq file > > to last position. > > From looking at seq_read_iter() I think for a rollback we would have > to reset seq_file.count and seq_file.index to their previous states. > At this location: > https://elixir.bootlin.com/linux/v6.14.3/source/fs/seq_file.c#L272 if > show_map() returns negative value m->count will indeed be rolled back > but not seq_file.index. Also returning negative value at > https://elixir.bootlin.com/linux/v6.14.3/source/fs/seq_file.c#L230 > would be interpreted as a hard error... So, I'll need to spend some > time in this code to get the answer about rollback. > Thanks for the review! Yeah, seq_file is a glorified wrapper around a memory buffer, essentially. And at the lowest level, this transaction-like API would basically just return seq->count before we start writing anything, and rollback will just accept a new count to set to seq->count, if we need to rollback. Logistically this all needs to be factored into the whole seq_file callbacks thing, of course, especially if "transaction" will be started in m_start/m_next, while it can be "aborted" in m_show... So that's what would need careful consideration. But you can end up with faster and cleaner implementation, as we discussed above, so worth giving it a try, IMO. > > > > > > > > + ret = mmap_read_lock_killable(priv->mm); > > > + if (!ret) { > > > + /* mmap_lock_speculate_try_begin() succeeds when holding > > > mmap_read_lock */ > > > + mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq); > > > + mmap_read_unlock(priv->mm); > > > + ret = -EAGAIN; > > > + } > > > + > > > + rcu_read_lock(); > > > + > > > + anon_vma_name_put_if_valid(copy); > > > +put_file: > > > + if (copy->vm_file) > > > + fput(copy->vm_file); > > > +out: > > > + return ret; > > > +} > > > + > > > +static void put_vma_snapshot(struct proc_maps_private *priv) > > > +{ > > > + struct vm_area_struct *vma = &priv->vma_copy; > > > + > > > + anon_vma_name_put_if_valid(vma); > > > + if (vma->vm_file) > > > + fput(vma->vm_file); > > > +} > > > + > > > > [...]