On Mon, 8 Sep 2025 16:09:50 -0700
Linus Torvalds <torva...@linux-foundation.org> wrote:


> > To compensate this, we could replace the path and build-id with a unique
> > identifier, (being an inode/device or hash, or whatever) to associate that
> > file. It may even work if it is unique per task. Then whenever one of these
> > identifiers were to show up representing a new file, it would be printed.  
> 
> So I really hate the inode number, because it's just wrong.

I just mentioned an identifier, it didn't need to be the inode.

> 
> So if you do that
> 
>     inode = file_user_inode(vma->vm_file);

And if I do end up using an inode, I'll make sure to use that.


> And *none* of these issues would be true of somebody who uses the
> 'perf()' interface that can do all of this much more efficiently, and
> without the downsides, and without any artificially limited sysfs
> interfaces.

Note, there is no user space component running during the trace when
tracing with tracefs, whereas perf requires a user space tool to be
running along with what is being traced. The idea, is not to affect what is
being traced by a user space tracer. Tracing is started when needed, and
when the anomaly is detected, tracing is stopped, and then the tooling
extracts the trace and post processes it.

> 
> So that really makes me go: just don't expose this at all in sysfs
> files.  You *cannot* do a good job in sysfs, because the interface is
> strictly worse than just doing the proper job using perf.
> 
> Alternatively, just do the expensive thing. Expose the actual
> pathname, and expose the build ID. Yes, it's expensive, but dammit,
> that's the whole *point* of tracing in sysfs. sysfs was never about
> being efficient, it was about convenience.

Technically, it's "tracefs" and not "sysfs". When tracefs is configured,
sysfs will create a directory called /sys/kernel/tracing to allow user
space to mount tracefs there, but it is still a separate file system which
can be mounted outside of sysfs.

The code in tracefs is designed to be very efficient and tries very hard to
keep the overhead down. The tracefs ring buffer is still 2 or 3 times
faster than the perf buffer. It is optimized for tracing, and that's even
why it doesn't rely on a user space component, as it's another way to allow
always-on-tracing to not affect the system as much while tracing.

> 
> So if you trace through sysfs, you either don't get the full
> information that could be there, or pay the price for the expense of
> generating the full info.

But I will say the time to get the path name isn't an issue here. It's the
size of the path name being recorded into the ring buffer. The ring buffer's
size is limited, and a lot of compaction techniques are used to try to use
it efficiently.

As the stack trace only happens when the task goes back to user space, it's
not as a time sensitive event as say function tracing. Thus spending a few
more microseconds on something isn't going to cause much of a notice.

An 8 byte identifier for a file is much better than the path name where it
can be 40 bytes or more. In my example:

 /usr/lib/x86_64-linux-gnu/libselinux.so.1

is 41 bytes, 42 if you count the nul terminating byte.

Now if we just hash the path name, that would be doable. Then when we see a
new name pop up, we can trigger another event to show the path name (and
perhaps even the build id). What's nice about triggering another event to
show the full path name, is that you can put that other event into another
buffer, to keep the path names from dropping stack traces, and vice versa.

I liked an idea you had in a previous email:
https://lore.kernel.org/all/CAHk-=wjgdktbaau10w04vtktrcgemzu+92sf1pw-tv-cfzo...@mail.gmail.com/

    You do it for the first time you see it, and every N times afterwards
    (maybe by simply using a counter array that is indexed by the low bits
    of the hash, and incrementing it for every hash you see, and if it was
    zero modulo N you do that "mmap reminder" thing).

Instead of a N counter, have a time expiry of say 200 milliseconds or so. At
the stack trace, look at the path name, hash it, put it into a look up
table, and if it's not there, trigger the other event to show the path name
and save it and a timestamp into the look up table. If it's in the look up
table, and the timestamp is greater than 200 milliseconds, trigger it again
and reset the timestamp.

The idea is only to remove duplicates, and move the longer names into a
separate buffer.

Recording full path names in every stack trace will make it much harder to
use the information as it will lose a lot more stack traces.

> 
> Make the "give me the expensive output" be a dynamic flag, so that you
> don't do it by default, but if you have some model where you are
> scripting things with shell-script rather than doing 'perf record', at
> least you get good output.

Note, it's not a shell script. We do have tooling, it's just that nothing
runs while the trace is happening. The tooling starts the trace and exits.
When the issue is discovered, tracing is stopped, and the tooling will then
extract the trace and process it. If a crash occurs, the persistent ring
buffer can be extracted to get the data.

We will use this in the field. That is, on chromebooks where people have
opted in to allow analysis of their machines. If there's an anomaly
detected in thousands of users, we can start tracing and then extract the
traces to debug what is happening on their machines. We want to make sure
we get enough stack traces that will go back far enough to where the issue
first occurred. Hence why we want to keep the traces small and compact.

-- Steve


Reply via email to