Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace

Steven Rostedt Mon, 08 Sep 2025 14:52:04 -0700

On Sat, 30 Aug 2025 12:03:53 -0700
Linus Torvalds <torva...@linux-foundation.org> wrote:

> On Sat, 30 Aug 2025 at 11:31, Steven Rostedt <rost...@goodmis.org> wrote:
> >
> > If we are going to rely on mmap, then we might as well get rid of the
> > vma_lookup() altogether. The mmap event will have the mapping of the
> > file to the actual virtual address.  
> 
> It actually won't - not unless you also track every mremap etc.
> 
> Which is certainly doable, but I'd argue that it's a lot of complexity.
> 
> All you really want is an ID for the file mapping, and yes, I agree
> that it's very very annoying that we don't have anything that can then
> be correlated to user space any other way than also having a stage
> that tracks mmap.
> 
> I've slept on it and tried to come up with something, and I can't. As
> mentioned, the inode->i_ino isn't actually exposed to user space as
> such at all for some common filesystems, so while it's very
> traditional, it really doesn't actually work. It's also almost
> impossible to turn into a path, which is what you often would want for
> many cases.
> 
> That said, having slept on it, I'm starting to come around to the
> inode number model, not because I think it's a good model - it really
> isn't - but because it's a very historical mistake.
> 
> And in particular, it's the same mistake we made in /proc/<xyz>/maps.
> 
> So I think it's very very wrong, but it does have the advantage that
> it's a number that we already do export.
> 
> But the inode we expose that way isn't actually the
> 'vma->vm_file->f_inode' as you'd think, it's actually
> 
>         inode = file_user_inode(vma->vm_file);
> 
> which is subtly different for the backing inode case (ie overlayfs).
> 
> Oh, how I dislike that thing, but using the same thing as
> /proc/<xyz>/maps does avoid some problems.
> 

Sorry for the late reply. I left to the Tracing Summit the following
Monday, and when I got back home on Thursday, I came down with a nasty cold
that prevented me from thinking about any of this.

I just re-read the entire thread, and I'm still not sure where to go with
this. Thus, let me start with what I'm trying to accomplish, and even add
one example of a real world use case we would like to have.

Several times we find issues with futexes causing applications to either
lock up or cause long latency. Since a futex is mostly managed in user
space, it's good to be able to at least have a backtrace of where a
contended futex occurs. Thus we start tracing the futex system call and
triggering a user space backtrace on each one. Using this information can
help us figure out where the futex contention lies. This is just one use
case, we do have others.

Ideally, the user space stack trace should look like:

   futex_requeue-1044    [002] .....   168.761423: <user stack unwind>
cookie=31500000003
 =>  <000000000009a9ee> : path=/usr/lib/x86_64-linux-gnu/libselinux.so.1 
build_id={0x3ba6e0c2,0xdd815e8,0xe1821a58,0xa5940cef,0x7c7bc5ab}
 =>  <0000000000001472> : path=/work/c/futex_requeue 
build_id={0xc02417ea,0x1f4e0143,0x338cf27d,0x506a7a5d,0x7884d090}
 =>  <0000000000092b7b> : path=/usr/lib/x86_64-linux-gnu/libselinux.so.1 
build_id={0x3ba6e0c2,0xdd815e8,0xe1821a58,0xa5940cef,0x7c7bc5ab}

Where the above shows the callstack (offset from the file), the path to the
file, and a build id of that file such that the tooling can verify that the
path is indeed the same library/executable as for when the trace occurred.

Note, the build-id isn't really necessary for my own use case, because the
applications seldom change on a chromebook. I added it as it appears to be
useful for others I've talked to that would like to use this.

But printing a copy of the full path name and build-id at every stack trace
is expensive. The path lookup may not be so bad, but the space on the ring
buffer is.

To compensate this, we could replace the path and build-id with a unique
identifier, (being an inode/device or hash, or whatever) to associate that
file. It may even work if it is unique per task. Then whenever one of these
identifiers were to show up representing a new file, it would be printed.

We could monitor an event that if a file is deleted, renamed, or whatever,
and a new file with the same name comes around, the identifier with the
path and build-id gets printed for the new file.

Where the output would be, instead:

             sed-1037    [007] ...1.   167.362583: file_map: hash=0x51eff94b 
path=/usr/lib/x86_64-linux-gnu/libselinux.so.1 
build_id={0x3ba6e0c2,0xdd815e8,0xe1821a58,0xa5940cef,0x7c7bc5ab}
[..]
   futex_requeue-1042    [007] ...1.   168.754128: file_map: hash=0xad2c6f1b 
path=/work/c/futex_requeue 
build_id={0xc02417ea,0x1f4e0143,0x338cf27d,0x506a7a5d,0x7884d090}
[..]
   futex_requeue-1042    [007] .....   168.757912: <user stack unwind>
cookie=34900000008
 =>  <00000000001001ca> : 0x51eff94b
 =>  <000000000000173c> : 0xad2c6f1b
 =>  <0000000000029ca8> : 0x51eff94b
[.. repeats several more traces without having to save the path names again ..]

It comes down to when do we print these mappings?

I noticed that uprobes has hooks to all the mmappings in the vma code as it
needs to keep track of them. We could change those hooks to tracepoints,
and have both uprobes and tracing monitor the changes, and when a new
mapping happens, it traces it. Changing them to tracepoints may be useful
anyway, as it would then turn them over to static branchs and not a normal
"if" statement.

We could even add a file to tracefs that would trigger the dump of all
files that are mapped executable for all currently running tasks.Then when
tracing starts, it would trigger the "show all currently running task
mappings" and then only do the mappings on demand. This way, the tracer
would get the mappings of the identifier (or hash, or whatever) to the
files and build-ids at the start of tracing, as well as get any of the
mappings when they happen later on.

This should have enough information for the post processing to put the
stack traces back to what is ideal in the first place. That is, the tooling
could output:

   futex_requeue-1044    [002] .....   168.761423: <user stack unwind>
cookie=31500000003
 =>  <000000000009a9ee> : path=/usr/lib/x86_64-linux-gnu/libselinux.so.1 
build_id={0x3ba6e0c2,0xdd815e8,0xe1821a58,0xa5940cef,0x7c7bc5ab}
 =>  <0000000000001472> : path=/work/c/futex_requeue 
build_id={0xc02417ea,0x1f4e0143,0x338cf27d,0x506a7a5d,0x7884d090}
 =>  <0000000000092b7b> : path=/usr/lib/x86_64-linux-gnu/libselinux.so.1 
build_id={0x3ba6e0c2,0xdd815e8,0xe1821a58,0xa5940cef,0x7c7bc5ab}

and hide the identifier that was used in the ring buffer.

-- Steve

Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace

Reply via email to