On Mon, 8 Sep 2025 16:09:50 -0700 Linus Torvalds <torva...@linux-foundation.org> wrote:
> > To compensate this, we could replace the path and build-id with a unique > > identifier, (being an inode/device or hash, or whatever) to associate that > > file. It may even work if it is unique per task. Then whenever one of these > > identifiers were to show up representing a new file, it would be printed. > > So I really hate the inode number, because it's just wrong. I just mentioned an identifier, it didn't need to be the inode. > > So if you do that > > inode = file_user_inode(vma->vm_file); And if I do end up using an inode, I'll make sure to use that. > And *none* of these issues would be true of somebody who uses the > 'perf()' interface that can do all of this much more efficiently, and > without the downsides, and without any artificially limited sysfs > interfaces. Note, there is no user space component running during the trace when tracing with tracefs, whereas perf requires a user space tool to be running along with what is being traced. The idea, is not to affect what is being traced by a user space tracer. Tracing is started when needed, and when the anomaly is detected, tracing is stopped, and then the tooling extracts the trace and post processes it. > > So that really makes me go: just don't expose this at all in sysfs > files. You *cannot* do a good job in sysfs, because the interface is > strictly worse than just doing the proper job using perf. > > Alternatively, just do the expensive thing. Expose the actual > pathname, and expose the build ID. Yes, it's expensive, but dammit, > that's the whole *point* of tracing in sysfs. sysfs was never about > being efficient, it was about convenience. Technically, it's "tracefs" and not "sysfs". When tracefs is configured, sysfs will create a directory called /sys/kernel/tracing to allow user space to mount tracefs there, but it is still a separate file system which can be mounted outside of sysfs. The code in tracefs is designed to be very efficient and tries very hard to keep the overhead down. The tracefs ring buffer is still 2 or 3 times faster than the perf buffer. It is optimized for tracing, and that's even why it doesn't rely on a user space component, as it's another way to allow always-on-tracing to not affect the system as much while tracing. > > So if you trace through sysfs, you either don't get the full > information that could be there, or pay the price for the expense of > generating the full info. But I will say the time to get the path name isn't an issue here. It's the size of the path name being recorded into the ring buffer. The ring buffer's size is limited, and a lot of compaction techniques are used to try to use it efficiently. As the stack trace only happens when the task goes back to user space, it's not as a time sensitive event as say function tracing. Thus spending a few more microseconds on something isn't going to cause much of a notice. An 8 byte identifier for a file is much better than the path name where it can be 40 bytes or more. In my example: /usr/lib/x86_64-linux-gnu/libselinux.so.1 is 41 bytes, 42 if you count the nul terminating byte. Now if we just hash the path name, that would be doable. Then when we see a new name pop up, we can trigger another event to show the path name (and perhaps even the build id). What's nice about triggering another event to show the full path name, is that you can put that other event into another buffer, to keep the path names from dropping stack traces, and vice versa. I liked an idea you had in a previous email: https://lore.kernel.org/all/CAHk-=wjgdktbaau10w04vtktrcgemzu+92sf1pw-tv-cfzo...@mail.gmail.com/ You do it for the first time you see it, and every N times afterwards (maybe by simply using a counter array that is indexed by the low bits of the hash, and incrementing it for every hash you see, and if it was zero modulo N you do that "mmap reminder" thing). Instead of a N counter, have a time expiry of say 200 milliseconds or so. At the stack trace, look at the path name, hash it, put it into a look up table, and if it's not there, trigger the other event to show the path name and save it and a timestamp into the look up table. If it's in the look up table, and the timestamp is greater than 200 milliseconds, trigger it again and reset the timestamp. The idea is only to remove duplicates, and move the longer names into a separate buffer. Recording full path names in every stack trace will make it much harder to use the information as it will lose a lot more stack traces. > > Make the "give me the expensive output" be a dynamic flag, so that you > don't do it by default, but if you have some model where you are > scripting things with shell-script rather than doing 'perf record', at > least you get good output. Note, it's not a shell script. We do have tooling, it's just that nothing runs while the trace is happening. The tooling starts the trace and exits. When the issue is discovered, tracing is stopped, and the tooling will then extract the trace and process it. If a crash occurs, the persistent ring buffer can be extracted to get the data. We will use this in the field. That is, on chromebooks where people have opted in to allow analysis of their machines. If there's an anomaly detected in thousands of users, we can start tracing and then extract the traces to debug what is happening on their machines. We want to make sure we get enough stack traces that will go back far enough to where the issue first occurred. Hence why we want to keep the traces small and compact. -- Steve