On 17/11/2021 10:40, Florian Weimer wrote: > * Adhemerval Zanella via Libc-alpha: > >> However the code is somewhat complex and I would like to have some feedback >> if gcc will be willing to accept this change (I assume it would require >> this code merge on glibc beforehand). > > There's a long review queue on the GCC side due to the stage1 close. > It may still be considered for GCC 12. Jakub has also requested that > we hold off committing the glibc side until the GCC side is reviewed. > > I'll flesh out the commit message and NEWS entry once we have agreed > upon the interface. > >>> new file mode 100644 >>> index 0000000000..c7313c122d >>> --- /dev/null >>> +++ b/elf/dl-find_eh_frame.c > >>> +/* Data for the main executable. There is usually a large gap between >>> + the main executable and initially loaded shared objects. Record >>> + the main executable separately, to increase the chance that the >>> + range for the non-closeable mappings below covers only the shared >>> + objects (and not also the gap between main executable and shared >>> + objects). */ >>> +static uintptr_t _dl_eh_main_map_start attribute_relro; >>> +static struct dl_eh_frame_info _dl_eh_main_info attribute_relro; >>> + >>> +/* Data for initally loaded shared objects that cannot be unlaoded. >> >> s/initally/initially and s/unlaoded/unloaded. > > Fixed. > >> >>> + The mapping base addresses are stored in address order in the >>> + _dl_eh_nodelete_mappings_bases array (containing >>> + _dl_eh_nodelete_mappings_size elements). The EH data for a base >>> + address is stored in the parallel _dl_eh_nodelete_mappings_infos. >>> + These arrays are not modified after initialization. */ >>> +static uintptr_t _dl_eh_nodelete_mappings_end attribute_relro; >>> +static size_t _dl_eh_nodelete_mappings_size attribute_relro; >>> +static uintptr_t *_dl_eh_nodelete_mappings_bases attribute_relro; >>> +static struct dl_eh_frame_info *_dl_eh_nodelete_mappings_infos >>> + attribute_relro; >>> + >>> +/* Mappings created by dlopen can go away with dlclose, so a data >>> + dynamic data structure with some synchronization is needed. >> >> This sounds strange ("a data dynamic data"). > > I dropped the first data. > >> >>> + Individual segments are similar to the _dl_eh_nodelete_mappings >> >> Maybe use _dl_eh_nodelete_mappings_*, because '_dl_eh_nodelete_mappings' >> itself if not defined anywhere. > > Right. > >>> + Adding new elements to this data structure is another source of >>> + quadratic behavior for dlopen. If the other causes of quadratic >>> + behavior are eliminated, a more complicated data structure will be >>> + needed. */ >> >> This worries me, specially we have reports that python and other dynamic >> environments do use a lot of plugin and generates a lot of dlopen() calls. >> What kind of performance implication do you foresee here? > > The additional overhead is not disproportionate to the other sources of > quadratic behavior. With 1,000 dlopen'ed objects, overall run-time > seems to be comparable to the strcmp time required soname matching, for > example, and is quite difficult to measure. So we could fix the > performance regression if we used a hash table for that … > > It's just an undesirable complexity class. The implementation is not > actually slow because it's a mostly-linear copy (although a backwards > one). Other parts of dlopen involve pointer chasing and are much > slower. Right, I agree this should probably won't incur in performance issues, I was curious if you have any numbers about it. > >>> +/* Allocate an empty segment that is at least SIZE large. PREVIOUS */ >> >> What this PREVIOUS refer to? > > Oops, it's now: > > /* Allocate an empty segment that is at least SIZE large. PREVIOUS > points to the chain of previously allocated segments and can be > NULL. */ > >>> +/* Update the version to reflect that an update is happening. This >>> + does not change the bit that controls the active segment chain. >>> + Returns the index of the currently active segment chain. */ >>> +static inline unsigned int >>> +_dl_eh_mappings_begin_update (void) >>> +{ >>> + unsigned int v >>> + = __atomic_wide_counter_fetch_add_relaxed >>> (&_dl_eh_loaded_mappings_version, >>> + 2); >> >> Why use an 'unsigned int' for the wide counter here? > > Because … > >>> + /* Subsequent stores to the TM data must not be reordered before the >>> + store above with the version update. */ >>> + atomic_thread_fence_release (); >>> + return v & 1; >>> +} > > … we only need the lower bit. Ack, I guess it won't matter to compiler. > >>> + /* Other initially loaded objects. */ >>> + if (pc >= *_dl_eh_nodelete_mappings_bases >>> + && pc < _dl_eh_nodelete_mappings_end) >>> + { >>> + size_t idx = _dl_eh_find_lower_bound (pc, >>> + _dl_eh_nodelete_mappings_bases, >>> + _dl_eh_nodelete_mappings_size); >>> + const struct dl_eh_frame_info *info >>> + = _dl_eh_nodelete_mappings_infos + idx; >> >> Ins't a UB if idx is not a valid one? > > idx is always valid here. > >>> + bool match; >>> + if (idx < _dl_eh_nodelete_mappings_size >>> + && pc == _dl_eh_nodelete_mappings_bases[idx]) >>> + match = true; >>> + else >>> + { >>> + /* PC might be in the previous mapping. */ >>> + --idx; >>> + --info; >>> + match = pc - _dl_eh_nodelete_mappings_bases[idx] < info->size; >>> + } >> >> It is not clear to me why _dl_eh_find_lower_bound() can not handle this >> specific exception, since this is also used in the audit, dlopen case >> below. > > The struct layouts are different. Maybe I should use a segment > structure for _dl_eh_nodelete_mappings_bases as well? I am not sure, it is just that check does seems similar and it is used on a couple of places. Maybe you can consolidate it. > >>> + /* Handle audit modules, dlopen, dlopen objects. This uses software >>> + transactional memory, with a retry loop in case the version >>> + changes during execution. */ >>> + while (true) >>> + { >>> + retry: >>> + ; >>> + uint64_t start_version = _dl_eh_read_start_version (); > >>> + for (struct dl_eh_mappings_segment *seg >>> + = _dl_eh_mappings_active_segment (start_version); >>> + seg != NULL && seg->size > 0; seg = seg->previous) >>> + if (pc >= seg->bases[0]) >>> + { >>> + /* PC may lie within this segment. If it is less than the > >>> + if (match) >>> + { >>> + /* Found the right mapping. Copy out the data prior to >>> + checking if the read transaction was successful. */ > >>> + if (_dl_eh_read_success (start_version)) > >>> + else >>> + /* Read transaction failure. */ >>> + goto retry; >> >> Why not 'continue' here? > > We need to re-run the outer while loop, not the inner segment > iteration. Indeed. > >>> +/* _dl_eh_process_initial is called twice. First to compute the array >>> + sizes from the initial loaded mappings. Second to fill in the >>> + bases and infos arrays with the (still unsorted) data. Returns the >>> + number of loaded (non-nodelete) mappings. */ >> >> I usually see this called twice functionn confunsing because it not clear >> which is the default patch used for each call (although the comments do >> help it). >> >> Maybe you can add two function, one that calculates the initial size and >> another that actually populates it. It allows, for instance, to call >> the initial_map _dl_get_eh_frame() once (you will need to keep the >> struct dl_eh_frame_info between calls). > > I started out with this and found it more confusing because of the > complicated conditions (apart from the phase distinction). Right. > >>> +static void >>> +_dl_eh_frame_link_map_sort (struct link_map **loaded, size_t size) >>> +{ >>> + /* Selection sort based on map_start. */ >>> + if (size < 2) >>> + return; >>> + for (size_t i = 0; i < size - 1; ++i) >>> + { >>> + /* Find minimum. */ >>> + size_t min_idx = i; >>> + ElfW(Addr) min_val = loaded[i]->l_map_start; >>> + for (size_t j = i + 1; j < size; ++j) >>> + if (loaded[j]->l_map_start < min_val) >>> + { >>> + min_idx = j; >>> + min_val = loaded[j]->l_map_start; >>> + } >>> + >>> + /* Swap into place. */ >>> + struct link_map *tmp = loaded[min_idx]; >>> + loaded[min_idx] = loaded[i]; >>> + loaded[i] = tmp; >>> + } >>> +} >> >> Could this be a problem with programs with a lot of dependencies? > > I don't see it in profiles. It's quadratic for sure, but mostly linear > accesses, and the array fits into the L1 cache. We could optimize it > and check if the array is already sorted in reverse order, reflecting > the typical kernel mapping behavior. If this is not really a hotspot I would prefer the simplest code. > >>> +/* Invoked from _dl_eh_frame_frame after sorting. */ >>> +static bool >>> +_dl_find_eh_frame_update_1 (struct link_map **loaded, size_t count) >>> +{ >>> + int active_idx = _dl_eh_mappings_begin_update (); >>> + >>> + struct dl_eh_mappings_segment *current_seg >>> + = _dl_eh_loaded_mappings[active_idx]; >>> + size_t current_used = _dl_eh_mappings_segment_count_used (current_seg); >>> + >>> + struct dl_eh_mappings_segment *target_seg >>> + = _dl_eh_loaded_mappings[!active_idx]; >>> + size_t remaining_to_add = current_used + count; >>> + >>> + /* Ensure that the new segment chain has enough space. */ >>> + { >>> + size_t new_allocated >>> + = _dl_eh_mappings_segment_count_allocated (target_seg); >>> + if (new_allocated < remaining_to_add) >>> + { >>> + size_t more = remaining_to_add - new_allocated; >>> + target_seg = _dl_eh_mappings_segment_allocate (more, target_seg); >>> + if (target_seg == NULL) >>> + /* Out of memory. */ >>> + return false; >> >> Shouldn't it have a _dl_eh_mappings_end_update() here? > > No, we should keep the original version. I added: > > /* Out of memory. Do not end the update and keep the > current version unchanged. */ > > We could split the counter update and current version access, and start > the update only after the malloc, I think. This would make it less > likely that a concurrent thread sees multiple version updates, causing > multiple retries, I think. This sounds a nice improvement, although I think malloc failure would be unlikely on most systems with overcommit. > >>> +bool >>> +_dl_find_eh_frame_update (struct link_map *new_map) >>> +{ >>> + /* Copy the newly-loaded link maps into an array for sorting. */ >>> + size_t count = 0; >>> + for (struct link_map *l = new_map; l != NULL; l = l->l_next) >>> + count += !l->l_eh_frame_processed; >>> + struct link_map **map_array = malloc (count * sizeof (*map_array)); >> >> Maybe use a scratch_buffer here? It should cover 128 entries for 64-bit, > > This is only used for dlopen, and dlopen already has many malloc calls. > I don't think it matters. Ok. >>> diff --git a/elf/dl-find_eh_frame.h b/elf/dl-find_eh_frame.h >>> new file mode 100644 >>> index 0000000000..4bde9b14db >>> --- /dev/null >>> +++ b/elf/dl-find_eh_frame.h > >>> +/* Extract the exception handling data from a link map and writes it >>> + to *INFO. If no such data is available INFO->eh_frame will be >>> + NULL. */ >>> +static void __attribute__ ((unused)) >> >> Maybe use inline here and let compiler optimize? > > I don't think inlining is beneficial. > >> Or move to the be a proper function? > > It's also used from tests, so it's convenient to have it in a header. I usually don't like to resort extra attributes to avoid compiler warnings when we do have a standard solution. But I don't have a strong preference here. > >>> +/* This function is similar to _dl_find_eh_frame, but travers the link >>> + maps directly. It is used from audit modules before >>> + _dl_find_eh_frame_init has been called, and for testing. */ >>> +static void * >>> +_dl_find_eh_frame_slow (void *pc >>> +#if DL_FIND_EH_FRAME_DBASE >>> + , void **dbase >>> +#endif >>> + ) >>> +{ >>> + ElfW(Addr) addr = (ElfW(Addr)) pc; >>> + for (Lmid_t ns = 0; ns < GL(dl_nns); ++ns) >>> + for (struct link_map *l = GL(dl_ns)[ns]._ns_loaded; l != NULL; >>> + l = l->l_next) >>> + if (addr >= l->l_map_start && addr < l->l_map_end >>> + && (l->l_contiguous || _dl_addr_inside_object (l, addr))) >>> + { >>> + assert (ns == l->l_ns); >>> + struct dl_eh_frame_info info; >>> + _dl_get_eh_frame (l, &info); >>> +#if DL_FIND_EH_FRAME_DBASE >>> + *dbase = info.dbase; >>> +#endif >>> + return info.eh_frame; >>> + } >>> + >>> + /* Object not found. */ >>> + return NULL; >>> +} >> >> This is only used on dl-find-eh_frame.c, so there is no much gain in >> adding header to define it. > > I expected it to use it in tests. But currently it is not used, right? The tests seems to use solely _dl_find_eh_frame(). Do you plan to add other tests to stress it? Also, it always called on initialization, so you can indirectly call it through _dl_find_eh_frame(). > >>> +@deftypefun {void *} _dl_find_eh_frame (void *@var{pc}) >>> +@standards{GNU, dlfcn.h} >>> +@safety{@mtsafe{}@assafe{}@acsafe{}} >>> +This function returns a pointer to the unwinding information for the >>> +object that contains the program coder @var{pc}. If the platform uses >>> +DWARF unwinding information, this is the in-memory address of the >>> +@code{PT_GNU_EH_FRAME} segment. >>> + >>> +In case @var{pc} resides in an object that lacks unwinding information, >>> +the function returns @code{NULL}. If no object matches @var{pc}, >>> +@code{NULL} is returned as well. >>> + >>> +@code{_dl_find_eh_frame} itself is thread-safe. However, if the >>> +application invokes @code{dlclose} for the object that contains @var{pc} >>> +concurrently with @code{_dl_find_eh_frame} or after the call returns, >>> +accessing the unwinding data for that object is not safe. Therefore, >>> +the application needs to ensure by other means (e.g., by convention) >>> +that @var{pc} remains a valid code address while the unwinding >>> +information is processed. >>> + >>> +This function is a GNU extension. >> >> In the code you state this interface should be async-signa-safe, but there it >> no explicit documentation about it. > > It says so in the header: > > +@safety{@mtsafe{}@assafe{}@acsafe{}} Oops, thanks. > >>> diff --git a/sysdeps/i386/bits/dlfcn_eh_frame.h >>> b/sysdeps/i386/bits/dlfcn_eh_frame.h >>> new file mode 100644 >>> index 0000000000..98f6b37029 >>> --- /dev/null >>> +++ b/sysdeps/i386/bits/dlfcn_eh_frame.h > >>> +#ifndef _DLFCN_H >>> +# error "Never use <bits/dlfcn_eh_frame.h> directly; include <dlfcn.h> >>> instead." >>> +#endif >>> + >>> +/* This implementation uses a DBASE pointer argument in >>> + _dl_find_eh_frame. */ >>> +#define DL_FIND_EH_FRAME_DBASE 1 >> >> Maybe move DL_FIND_EH_FRAME_DBASE to its own header and sets the expected >> _dl_find_eh_frame() prototype based on it (so you don't need to replicate >> it on i386 and nios2). > > I think I wrote elsewhere that I could glibc/GCC settle on a different > interface eventually, e.g. > > struct dl_object_info > { > uint64_t flags; /* Would indicate NODELETE/otherwise persistent mappings. > */ > struct link_map *map; > void *begin; > void *end; > void *eh_frame; > #if DL_OBJECT_INFO_DBASE > void *dbase; > #endif > #if DL_OBJECT_INFO_PCOUNT > unsigned long int pcount; /* Used on Arm, see __gnu_Unwind_Find_exidx. */ > #endif > }; > > int _dl_find_object (void *address, struct dl_object_info *result); > > This interface would allow some additional optimizations in libgcc_s > (e.g., cache its own eh_frame value, and the libstdc++ eh_frame value if > it's a persistent mapping). The object boundary information could be > useful in other cases. And it's trivial to implement > _dl_find_dso_for_object on top of it, eliminating a potential bottleneck > from caller-sensitive functions. > > Thanks, > Florian >
Re: [PATCH 3/3] elf: Add _dl_find_eh_frame function
Adhemerval Zanella via Gcc-patches Tue, 23 Nov 2021 08:53:51 -0800
- [PATCH 0/3] Add _dl_find_eh_frame funct... Florian Weimer via Gcc-patches
- [PATCH 1/3] nptl: Extract <bits... Florian Weimer via Gcc-patches
- Re: [PATCH 1/3] nptl: Extract ... Adhemerval Zanella via Gcc-patches
- Re: [PATCH 1/3] nptl: Extract ... Jakub Jelinek via Gcc-patches
- Re: [PATCH 1/3] nptl: Extr... Florian Weimer via Gcc-patches
- [PATCH 2/3] elf: Introduce GLRO (d... Florian Weimer via Gcc-patches
- Re: [PATCH 2/3] elf: Introduce... Adhemerval Zanella via Gcc-patches
- [PATCH 3/3] elf: Add _dl_find_eh_f... Florian Weimer via Gcc-patches
- Re: [PATCH 3/3] elf: Add _dl_f... Adhemerval Zanella via Gcc-patches
- Re: [PATCH 3/3] elf: Add _... Florian Weimer via Gcc-patches
- Re: [PATCH 3/3] elf: A... Adhemerval Zanella via Gcc-patches
- Re: [PATCH 3/3] elf: Add _dl_f... Jakub Jelinek via Gcc-patches
- Re: [PATCH 3/3] elf: Add _... Florian Weimer via Gcc-patches
- Re: [PATCH 3/3] elf: Add _dl_f... Jakub Jelinek via Gcc-patches
- Re: [PATCH 3/3] elf: Add _... Florian Weimer via Gcc-patches