Re: [RFC] sketch of an unwinder cache interface for elfutils libdwfl
On 12/10/24 16:42, Serhei Makarov wrote: > This email sketches an 'unwinder cache' interface for libdwfl, derived from > recent eu-stacktrace code [1] and based on Christian Hergert's adaptation of > eu-stacktrace as sysprof-live-unwinder [2]. The intent is to remove the need > for a lot of boilerplate code that would be identical among profiling tools > that use libdwfl to unwind perf_events stack samples. But since this becomes > a library design, it's in need of feedback and bikeshedding. > > [1]: > https://sourceware.org/cgit/elfutils/tree/README.eu-stacktrace?h=users/serhei/eu-stacktrace > [2]: https://gitlab.gnome.org/GNOME/sysprof/-/merge_requests/110 > > Here's my initial take on what to implement, keeping things as simple as > possible. I'll follow up with more emails if or when my understanding of the > best design evolves during further prototyping. > > (1) Suggested interface for a new struct Dwfl_Process_Table: > > Dwfl_Process_Table *dwfl_process_table_begin (const Dwfl_Callbacks > *callbacks); > Dwfl *dwfl_process_table_find_pid (Dwfl_Process_Table *dwfltab, pid_t pid, > bool > create_if_missing); > > Dwfl, although documented in the header as a "session with the library", > actually supports reporting/attaching one process at a time. Its real meaning > is therefore "elfutils data for working with a particular Dwfl_Process". > > So, for global profiling, eu-stacktrace or the profiler end up creating a > table pid->Dwfl, with one Dwfl structure per process. I'd like elfutils to > provide this table, and in the implementation we could work on removing > redundant work between processes. Looking at Dwfl_Module, the relocations are > obviously specific to each process, but a lot of the Elf and Dwarf data is > potentially shareable between Dwfl instances. > > (2) Suggested interface for existing struct Dwfl: > > dwfl_perf_sample_getframes (dwfl, tid, >stack, stack_size, >regs, n_regs, abi, >perf_regs_mask, >int (*callback) > (Dwfl_Frame *state, void *arg), >void *arg); > /* could also implement dwfl_thread_perf_sample_getframes(thread, ... args as > above ...) */ > /* could package the args from stack .. perf_regs_mask into a struct > Dwfl_Stack_Sample? */ > > This implements the unwinding work that eu-stacktrace currently does on > behalf of the profiler. > > (3) Another thing the profiler needs to do is react to perf mmap events by > extending the Dwfl's representation of the process. Currently this is > implementable by calling dwfl_report_begin_add and imitating what > dwfl_linux_proc_maps_report does. The code to do this might be worth rolling > into libdwfl as a call along the lines of > > dwfl_report_proc_map (dwfl, filename, start, end, inode); Hi Serhei, The dwfl_report_proc_map() will handle when things are added to the memory map. Is there anything to handle the inverse event when memory is removed from the process memory map, such as a dlcose()? -Will > > though the exact details have to be worked out. I'm sure there's some > pitfalls here I haven't discovered yet. > > === > > * Why not the existing eu-stacktrace fifo interface? > > In practice, the fifo solution proves to be unsatisfying in terms of > perceived performance -- extra hoops through which the data must jump between > processes -- and maintainability -- the Sysprof binary format needs an > extension for representing stack data, with nontrivial details to bikeshed > there. The bikeshedding effort may as well be spent on designing a library > interface to the eu-stacktrace functionality. > > * Why do we need a Dwfl_Process_Table struct independent of Dwfl? > > As mentioned above, putting it in the library gives us a place to implement > caching of data/work that's redundant between different Dwfls. There's a > different way of designing this that would involve having Dwfl be the global > session struct and replacing the single Dwfl::process field with a table of > processes, but that requires redoing the existing library interface to be > parametrized by Dwfl_Process instead of just Dwfl, grossly disruptive > relative to the level of functionality I'm actually adding. We could also > have Dwfls share data through a hidden global variable, but it seems gross to > enable such sneaky caching without the user of the library explicitly > requesting it via Dwfl_Process_Table. > > All the best, > Serhei >
Re: [RFC] sketch of an unwinder cache interface for elfutils libdwfl
On Wed, Dec 11, 2024, at 9:46 AM, William Cohen wrote: > Hi Serhei, > > The dwfl_report_proc_map() will handle when things are added to the > memory map. Is there anything to handle the inverse event when memory > is removed from the process memory map, such as a dlcose()? > Interestingly, linux/perf/tools/design.txt mentions an 'munmap' field in struct perf_event_attr, but it doesn't seem to have made it into the actual kernel implementation of perf_events. I'll have to do some more digging to find out what happened with that idea between design and implementation. If we have reason to believe that the process mappings are changing too drastically to get sensible results with incremental updates, sysprof-live-unwinder's current quick solution (i.e. re-scan the proc mappings from scratch) is a feasible fallback. We just need a threshold for doing that which is 'sensible, but not too frequent'. Even a time threshold might be worth experimenting with for this. The main high-overhead case we're trying to avoid is when a process loads then does mmap, mmap, mmap, and the profiler is following this process from the start and reloading the Dwfl each time. -- Serhei
Re: [RFC] sketch of an unwinder cache interface for elfutils libdwfl
On vacation until 2025 so my replies will be sparse here :) On 12/11/24 3:24 AM, Milian Wolff wrote: For symmetry reasons, we also need an `..._end` method. And what about explicitly removing processes once they have finished (PERF_RECORD_EXIT)? One area you can run into issues with this is that there is no synchronization between Perf streams from different CPUs. So you could end up processing an EXIT from PID N on CPU 1 before a SAMPLE from PID N on CPU 2. -- Christian
Re: [RFC] sketch of an unwinder cache interface for elfutils libdwfl
On Mittwoch, 11. Dezember 2024 17:09:19 Mitteleuropäische Normalzeit Christian Hergert wrote: > On vacation until 2025 so my replies will be sparse here :) > > On 12/11/24 3:24 AM, Milian Wolff wrote: > > For symmetry reasons, we also need an `..._end` method. And what about > > explicitly removing processes once they have finished (PERF_RECORD_EXIT)? > > One area you can run into issues with this is that there is no > synchronization between Perf streams from different CPUs. So you could > end up processing an EXIT from PID N on CPU 1 before a SAMPLE from PID N > on CPU 2. I thought that the perf events are never ordered when you obtain them, you always first have to sort them based on their timestamps. For optimization purposes thankfully we have PERF_RECORD_FINISHED_ROUND so we don't need to buffer all events and sort them. That said, we do sometimes see violations of this, with some events arriving after a PERF_RECORD_FINISHED_ROUND with a timestamp earlier than what was previously handled in the last round. Is that what you have in mind, or are you thinking of something else that I am not aware of yet? Cheers -- Milian Wolff m...@milianw.de http://milianw.de signature.asc Description: This is a digitally signed message part.
Re: [RFC] sketch of an unwinder cache interface for elfutils libdwfl
On Mittwoch, 11. Dezember 2024 15:58:32 Mitteleuropäische Normalzeit Serhei Makarov wrote: > On Wed, Dec 11, 2024, at 9:46 AM, William Cohen wrote: > > Hi Serhei, > > > > The dwfl_report_proc_map() will handle when things are added to the > > memory map. Is there anything to handle the inverse event when memory > > is removed from the process memory map, such as a dlcose()? > > Interestingly, linux/perf/tools/design.txt mentions an 'munmap' field in > struct perf_event_attr, but it doesn't seem to have made it into the actual > kernel implementation of perf_events. I'll have to do some more digging to > find out what happened with that idea between design and implementation. > > If we have reason to believe that the process mappings are changing too > drastically to get sensible results with incremental updates, > sysprof-live-unwinder's current quick solution (i.e. re-scan the proc > mappings from scratch) is a feasible fallback. We just need a threshold for > doing that which is 'sensible, but not too frequent'. Even a time threshold > might be worth experimenting with for this. The main high-overhead case > we're trying to avoid is when a process loads then does mmap, mmap, mmap, > and the profiler is following this process from the start and reloading the > Dwfl each time. The way we handle this in perfparser is by not actually reporting anything until the time when we actually need it. I.e. we store the mmap events from perf in a way that we can use for fast lookups. Then during unwinding we ensure that the IP for whatever we are about to access is reported e.g. from the `memory_read` callback or for every IP we found before symbolication / inline frame resolution. When we receive more perf mmap events we check whether they invalidate old mappings (i.e. overlapping an existing region), we invalidate our caches,and unreport the modules from dwfl, and start the process anew. I don't think it's a good idea to report all modules all the time to elfutils - esp. in large applications with compressed debug information it can often be that you only need a fraction of the loaded shared symbols e.g. Also consider debuginfod - you don't want to download everything (such as GDB would be doing), you only want to download what's needed. Cheers -- Milian Wolff m...@milianw.de http://milianw.de signature.asc Description: This is a digitally signed message part.
Re: [RFC] sketch of an unwinder cache interface for elfutils libdwfl
On Dienstag, 10. Dezember 2024 22:42:25 Mitteleuropäische Normalzeit Serhei Makarov wrote: > This email sketches an 'unwinder cache' interface for libdwfl, derived from > recent eu-stacktrace code [1] and based on Christian Hergert's adaptation > of eu-stacktrace as sysprof-live-unwinder [2]. The intent is to remove the > need for a lot of boilerplate code that would be identical among profiling > tools that use libdwfl to unwind perf_events stack samples. But since this > becomes a library design, it's in need of feedback and bikeshedding. Thank you all for this initiative, it's much appreciated. Once you have an API available, I would be willing to test it in the scope of perfparser/hotspot and report back how it performs compared to what we have build there over the years. Some more notes inline below. > [1]: > https://sourceware.org/cgit/elfutils/tree/README.eu-stacktrace?h=users/serh > ei/eu-stacktrace [2]: > https://gitlab.gnome.org/GNOME/sysprof/-/merge_requests/110 > > Here's my initial take on what to implement, keeping things as simple as > possible. I'll follow up with more emails if or when my understanding of > the best design evolves during further prototyping. > > (1) Suggested interface for a new struct Dwfl_Process_Table: > > Dwfl_Process_Table *dwfl_process_table_begin (const Dwfl_Callbacks > *callbacks); > Dwfl *dwfl_process_table_find_pid (Dwfl_Process_Table > *dwfltab, pid_t pid, bool create_if_missing); For symmetry reasons, we also need an `..._end` method. And what about explicitly removing processes once they have finished (PERF_RECORD_EXIT)? > Dwfl, although documented in the header as a "session with the library", > actually supports reporting/attaching one process at a time. Its real > meaning is therefore "elfutils data for working with a particular > Dwfl_Process". > > So, for global profiling, eu-stacktrace or the profiler end up creating a > table pid->Dwfl, with one Dwfl structure per process. I'd like elfutils to > provide this table, and in the implementation we could work on removing > redundant work between processes. Looking at Dwfl_Module, the relocations > are obviously specific to each process, but a lot of the Elf and Dwarf data > is potentially shareable between Dwfl instances. I agree, this is crucial to have. Esp. when you profile multiple short-lived applications, if you encounter compressed debug information the current API design would force you to uncompress the data once per process which is very bad. Great that you are taking this into account right from the start! > (2) Suggested interface for existing struct Dwfl: > > dwfl_perf_sample_getframes (dwfl, tid, >stack, stack_size, >regs, n_regs, abi, >perf_regs_mask, >int (*callback) > (Dwfl_Frame *state, void *arg), void *arg); > /* could also implement dwfl_thread_perf_sample_getframes(thread, ... args > as above ...) */ /* could package the args from stack .. perf_regs_mask > into a struct Dwfl_Stack_Sample? */ > > This implements the unwinding work that eu-stacktrace currently does on > behalf of the profiler. What is the scope of this API - "just" the normal stack unwinding? While useful, this will leave a lot of complicated problems to solve - most notably for symbolication and inline frame resolution. Should we add easier-to-use API for those too, while at it? > (3) Another thing the profiler needs to do is react to perf mmap events by > extending the Dwfl's representation of the process. Currently this is > implementable by calling dwfl_report_begin_add and imitating what > dwfl_linux_proc_maps_report does. The code to do this might be worth > rolling into libdwfl as a call along the lines of > > dwfl_report_proc_map (dwfl, filename, start, end, inode); > > though the exact details have to be worked out. I'm sure there's some > pitfalls here I haven't discovered yet. Correct handling of mmap events is very hairy in practice - at least in my experience. In perfparser we have quite some code (cf. PerfElfMap) that tries to translate the raw mmap events into a format that is consumable by elfutils. Sometimes the regions are seemingly overlapping e.g. I would also request that this API does not deal with raw `start`, `end` values - because that's not what you get. You get `addr`, `len`, `pgoff` and then you must figure out how to correctly assemble these three values. Cheers > === > > * Why not the existing eu-stacktrace fifo interface? > > In practice, the fifo solution proves to be unsatisfying in terms of > perceived performance -- extra hoops through which the data must jump > between processes -- and maintainability -- the Sysprof binary format needs > an extension for representing stack data, with nontrivial details to > bikeshed there. The bikeshedding effo