Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace

Linus Torvalds Mon, 08 Sep 2025 16:10:43 -0700

On Mon, 8 Sept 2025 at 14:42, Steven Rostedt <rost...@goodmis.org> wrote:
>
> I just re-read the entire thread, and I'm still not sure where to go with
> this.


So honestly, I don't know how to get where you want to get - or
whether it's even *possible* without horrible performance impact.

And no, we're not adding crap interfaces to mmap/munmap just for a
stupid sysfs tracing thing.

> Ideally, the user space stack trace should look like:
>
>    futex_requeue-1044    [002] .....   168.761423: <user stack unwind> 
> cookie=31500000003
>  =>  <000000000009a9ee> : path=/usr/lib/x86_64-linux-gnu/libselinux.so.1 
> build_id={0x3ba6e0c2,0xdd815e8,0xe1821a58,0xa5940cef,0x7c7bc5ab}
>  =>  <0000000000001472> : path=/work/c/futex_requeue 
> build_id={0xc02417ea,0x1f4e0143,0x338cf27d,0x506a7a5d,0x7884d090}
>  =>  <0000000000092b7b> : path=/usr/lib/x86_64-linux-gnu/libselinux.so.1 
> build_id={0x3ba6e0c2,0xdd815e8,0xe1821a58,0xa5940cef,0x7c7bc5ab}

Yes. And I think that's what you should aim to generate. Not inode
numbers, because inode numbers are the wrong thing.

> Note, the build-id isn't really necessary for my own use case, because the
> applications seldom change on a chromebook. I added it as it appears to be
> useful for others I've talked to that would like to use this.

My personal suspicion is that in reality, the pathname is sufficient.
It's certainly a lot better than inode numbers are, in that the
pathname is meaningful even after-the-fact, and even on a different
machine etc. It's not some guaranteed match with some particular
library or executable version, no. But for some random one-time quick
scripting thing that uses sysfs, it's probably "good enough".

The build id is certainly very convenient too, but it's not *always*
convenient. And 99% of the time you could just look up the build id
from the path, even though obviously that wouldn't work across
machines and wouldn't work across system updates.

> But printing a copy of the full path name and build-id at every stack trace
> is expensive. The path lookup may not be so bad, but the space on the ring
> buffer is.

So that's the thing. You can do it right, or you can do it wrong. I'd
personally tend to prefer the "expensive but right", and just make it
a trace-time option.

> To compensate this, we could replace the path and build-id with a unique
> identifier, (being an inode/device or hash, or whatever) to associate that
> file. It may even work if it is unique per task. Then whenever one of these
> identifiers were to show up representing a new file, it would be printed.

So I really hate the inode number, because it's just wrong.

You can't match it across machines, and to make things worse it's not
even *meaningful* over time or over machines - or to humans - so it's
strictly clearly objectively worse than the pathname.

But more importanly - rven on the *local* machine - and at the moment
- it's actually wrong.

Exactly because the inode number you look up is *not* the user-visible
inode number from 'stat()'.

So it's *really* wrong to use the inode number. It's basically never
right. And bever will be, even if you can make it appear useful in
some specific cases.

The *one* saving grace for the inode number is that *in*the*moment*
you can match it against /proc/<pid>/maps, because that /proc file has
that historical bug too (it wasn't buggy at the time that /proc file
was introduced, but our filesystems have become much more complex
since).

So if you do that

    inode = file_user_inode(vma->vm_file);

that I mentioned, at least the otherwise random inode numbers can be
matched to *something*.

That still doesn't fix the other issues with inode numbers, but it
means that at the time of the trace - and on the machine that the
tracing is done - you can now match that not-quite-real inode number
and device against another /proc file, and turn it into a pathname.

But it's kind of sad to do that, when you could just do the pathname
in the trace directly, and not force the stupid interface in the first
place.

And honestly, at that point it's still not really *better* than the
pathname (and arguably much much worse, because you might not be able
to do the matching if you didn't catch the /proc/<pid>/maps file).

So the inode number - together with a lookup in /proc/<pid>/maps - is
generally about the same as just giving a path, but typically much
less convenient, and anybody using that interface would have to do
extra work in user space.

And *none* of these issues would be true of somebody who uses the
'perf()' interface that can do all of this much more efficiently, and
without the downsides, and without any artificially limited sysfs
interfaces.

So that really makes me go: just don't expose this at all in sysfs
files.  You *cannot* do a good job in sysfs, because the interface is
strictly worse than just doing the proper job using perf.

Alternatively, just do the expensive thing. Expose the actual
pathname, and expose the build ID. Yes, it's expensive, but dammit,
that's the whole *point* of tracing in sysfs. sysfs was never about
being efficient, it was about convenience.

So if you trace through sysfs, you either don't get the full
information that could be there, or pay the price for the expense of
generating the full info.

Make the "give me the expensive output" be a dynamic flag, so that you
don't do it by default, but if you have some model where you are
scripting things with shell-script rather than doing 'perf record', at
least you get good output.

Hmm?

           Linus

Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace

Reply via email to