On 17/11/2021 10:40, Florian Weimer wrote:
> * Adhemerval Zanella via Libc-alpha:
> 
>> However the code is somewhat complex and I would like to have some feedback
>> if gcc will be willing to accept this change (I assume it would require
>> this code merge on glibc beforehand).
> 
> There's a long review queue on the GCC side due to the stage1 close.
> It may still be considered for GCC 12.  Jakub has also requested that
> we hold off committing the glibc side until the GCC side is reviewed.
> 
> I'll flesh out the commit message and NEWS entry once we have agreed
> upon the interface.
> 
>>> new file mode 100644
>>> index 0000000000..c7313c122d
>>> --- /dev/null
>>> +++ b/elf/dl-find_eh_frame.c
> 
>>> +/* Data for the main executable.  There is usually a large gap between
>>> +   the main executable and initially loaded shared objects.  Record
>>> +   the main executable separately, to increase the chance that the
>>> +   range for the non-closeable mappings below covers only the shared
>>> +   objects (and not also the gap between main executable and shared
>>> +   objects).  */
>>> +static uintptr_t _dl_eh_main_map_start attribute_relro;
>>> +static struct dl_eh_frame_info _dl_eh_main_info attribute_relro;
>>> +
>>> +/* Data for initally loaded shared objects that cannot be unlaoded.
>>
>> s/initally/initially and s/unlaoded/unloaded.
> 
> Fixed.
> 
>>
>>> +   The mapping base addresses are stored in address order in the
>>> +   _dl_eh_nodelete_mappings_bases array (containing
>>> +   _dl_eh_nodelete_mappings_size elements).  The EH data for a base
>>> +   address is stored in the parallel _dl_eh_nodelete_mappings_infos.
>>> +   These arrays are not modified after initialization.  */
>>> +static uintptr_t _dl_eh_nodelete_mappings_end attribute_relro;
>>> +static size_t _dl_eh_nodelete_mappings_size attribute_relro;
>>> +static uintptr_t *_dl_eh_nodelete_mappings_bases attribute_relro;
>>> +static struct dl_eh_frame_info *_dl_eh_nodelete_mappings_infos
>>> +  attribute_relro;
>>> +
>>> +/* Mappings created by dlopen can go away with dlclose, so a data
>>> +   dynamic data structure with some synchronization is needed.
>>
>> This sounds strange ("a data dynamic data").
> 
> I dropped the first data.
> 
>>
>>> +   Individual segments are similar to the _dl_eh_nodelete_mappings
>>
>> Maybe use _dl_eh_nodelete_mappings_*, because '_dl_eh_nodelete_mappings'
>> itself if not defined anywhere.
> 
> Right.
> 
>>> +   Adding new elements to this data structure is another source of
>>> +   quadratic behavior for dlopen.  If the other causes of quadratic
>>> +   behavior are eliminated, a more complicated data structure will be
>>> +   needed.  */
>>
>> This worries me, specially we have reports that python and other dynamic
>> environments do use a lot of plugin and generates a lot of dlopen() calls.
>> What kind of performance implication do you foresee here?
> 
> The additional overhead is not disproportionate to the other sources of
> quadratic behavior.  With 1,000 dlopen'ed objects, overall run-time
> seems to be comparable to the strcmp time required soname matching, for
> example, and is quite difficult to measure.  So we could fix the
> performance regression if we used a hash table for that …
> 
> It's just an undesirable complexity class.  The implementation is not
> actually slow because it's a mostly-linear copy (although a backwards
> one).  Other parts of dlopen involve pointer chasing and are much
> slower.

Right, I agree this should probably won't incur in performance issues,
I was curious if you have any numbers about it.

> 
>>> +/* Allocate an empty segment that is at least SIZE large.  PREVIOUS */
>>
>> What this PREVIOUS refer to?
> 
> Oops, it's now:
> 
> /* Allocate an empty segment that is at least SIZE large.  PREVIOUS
>    points to the chain of previously allocated segments and can be
>    NULL.  */
> 
>>> +/* Update the version to reflect that an update is happening.  This
>>> +   does not change the bit that controls the active segment chain.
>>> +   Returns the index of the currently active segment chain.  */
>>> +static inline unsigned int
>>> +_dl_eh_mappings_begin_update (void)
>>> +{
>>> +  unsigned int v
>>> +    = __atomic_wide_counter_fetch_add_relaxed 
>>> (&_dl_eh_loaded_mappings_version,
>>> +                                               2);
>>
>> Why use an 'unsigned int' for the wide counter here?
> 
> Because …
> 
>>> +  /* Subsequent stores to the TM data must not be reordered before the
>>> +     store above with the version update.  */
>>> +  atomic_thread_fence_release ();
>>> +  return v & 1;
>>> +}
> 
> … we only need the lower bit.

Ack, I guess it won't matter to compiler.

> 
>>> +  /* Other initially loaded objects.  */
>>> +  if (pc >= *_dl_eh_nodelete_mappings_bases
>>> +      && pc < _dl_eh_nodelete_mappings_end)
>>> +    {
>>> +      size_t idx = _dl_eh_find_lower_bound (pc,
>>> +                                            _dl_eh_nodelete_mappings_bases,
>>> +                                            _dl_eh_nodelete_mappings_size);
>>> +      const struct dl_eh_frame_info *info
>>> +        = _dl_eh_nodelete_mappings_infos + idx;
>>
>> Ins't a UB if idx is not a valid one?
> 
> idx is always valid here.
> 
>>> +      bool match;
>>> +      if (idx < _dl_eh_nodelete_mappings_size
>>> +          && pc == _dl_eh_nodelete_mappings_bases[idx])
>>> +        match = true;
>>> +      else
>>> +        {
>>> +          /* PC might be in the previous mapping.  */
>>> +          --idx;
>>> +          --info;
>>> +          match = pc - _dl_eh_nodelete_mappings_bases[idx] < info->size;
>>> +        }
>>
>> It is not clear to me why _dl_eh_find_lower_bound() can not handle this
>> specific exception, since this is also used in the audit, dlopen case
>> below.
> 
> The struct layouts are different.  Maybe I should use a segment
> structure for _dl_eh_nodelete_mappings_bases as well?

I am not sure, it is just that check does seems similar and it is used on
a couple of places. Maybe you can consolidate it.

> 
>>> +  /* Handle audit modules, dlopen, dlopen objects.  This uses software
>>> +     transactional memory, with a retry loop in case the version
>>> +     changes during execution.  */
>>> +  while (true)
>>> +    {
>>> +    retry:
>>> +      ;
>>> +      uint64_t start_version = _dl_eh_read_start_version ();
> 
>>> +      for (struct dl_eh_mappings_segment *seg
>>> +             = _dl_eh_mappings_active_segment (start_version);
>>> +           seg != NULL && seg->size > 0; seg = seg->previous)
>>> +        if (pc >= seg->bases[0])
>>> +          {
>>> +            /* PC may lie within this segment.  If it is less than the
> 
>>> +            if (match)
>>> +              {
>>> +                /* Found the right mapping.  Copy out the data prior to
>>> +                   checking if the read transaction was successful.  */
> 
>>> +                if (_dl_eh_read_success (start_version))
> 
>>> +                else
>>> +                  /* Read transaction failure.  */
>>> +                  goto retry;
>>
>> Why not 'continue' here?
> 
> We need to re-run the outer while loop, not the inner segment
> iteration.

Indeed.

> 
>>> +/* _dl_eh_process_initial is called twice.  First to compute the array
>>> +   sizes from the initial loaded mappings.  Second to fill in the
>>> +   bases and infos arrays with the (still unsorted) data.  Returns the
>>> +   number of loaded (non-nodelete) mappings.  */
>>
>> I usually see this called twice functionn confunsing because it not clear
>> which is the default patch used for each call (although the comments do
>> help it).
>>
>> Maybe you can add two function, one that calculates the initial size and
>> another that actually populates it.  It allows, for instance, to call 
>> the initial_map _dl_get_eh_frame() once (you will need to keep the
>> struct dl_eh_frame_info between calls).
> 
> I started out with this and found it more confusing because of the
> complicated conditions (apart from the phase distinction).

Right.

> 
>>> +static void
>>> +_dl_eh_frame_link_map_sort (struct link_map **loaded, size_t size)
>>> +{
>>> +  /* Selection sort based on map_start.  */
>>> +  if (size < 2)
>>> +    return;
>>> +  for (size_t i = 0; i < size - 1; ++i)
>>> +    {
>>> +      /* Find minimum.  */
>>> +      size_t min_idx = i;
>>> +      ElfW(Addr) min_val = loaded[i]->l_map_start;
>>> +      for (size_t j = i + 1; j < size; ++j)
>>> +        if (loaded[j]->l_map_start < min_val)
>>> +          {
>>> +            min_idx = j;
>>> +            min_val = loaded[j]->l_map_start;
>>> +          }
>>> +
>>> +      /* Swap into place.  */
>>> +      struct link_map *tmp = loaded[min_idx];
>>> +      loaded[min_idx] = loaded[i];
>>> +      loaded[i] = tmp;
>>> +    }
>>> +}
>>
>> Could this be a problem with programs with a lot of dependencies?
> 
> I don't see it in profiles.  It's quadratic for sure, but mostly linear
> accesses, and the array fits into the L1 cache.  We could optimize it
> and check if the array is already sorted in reverse order, reflecting
> the typical kernel mapping behavior.

If this is not really a hotspot I would prefer the simplest code.

> 
>>> +/* Invoked from _dl_eh_frame_frame after sorting.  */
>>> +static bool
>>> +_dl_find_eh_frame_update_1 (struct link_map **loaded, size_t count)
>>> +{
>>> +  int active_idx = _dl_eh_mappings_begin_update ();
>>> +
>>> +  struct dl_eh_mappings_segment *current_seg
>>> +    = _dl_eh_loaded_mappings[active_idx];
>>> +  size_t current_used = _dl_eh_mappings_segment_count_used (current_seg);
>>> +
>>> +  struct dl_eh_mappings_segment *target_seg
>>> +    = _dl_eh_loaded_mappings[!active_idx];
>>> +  size_t remaining_to_add = current_used + count;
>>> +
>>> +  /* Ensure that the new segment chain has enough space.  */
>>> +  {
>>> +    size_t new_allocated
>>> +      = _dl_eh_mappings_segment_count_allocated (target_seg);
>>> +    if (new_allocated < remaining_to_add)
>>> +      {
>>> +        size_t more = remaining_to_add - new_allocated;
>>> +        target_seg = _dl_eh_mappings_segment_allocate (more, target_seg);
>>> +        if (target_seg == NULL)
>>> +          /* Out of memory.  */
>>> +          return false;
>>
>> Shouldn't it have a _dl_eh_mappings_end_update() here?
> 
> No, we should keep the original version.  I added:
> 
>           /* Out of memory.  Do not end the update and keep the
>              current version unchanged.  */
> 
> We could split the counter update and current version access, and start
> the update only after the malloc, I think.  This would make it less
> likely that a concurrent thread sees multiple version updates, causing
> multiple retries, I think.

This sounds a nice improvement, although I think malloc failure would
be unlikely on most systems with overcommit.

> 
>>> +bool
>>> +_dl_find_eh_frame_update (struct link_map *new_map)
>>> +{
>>> +  /* Copy the newly-loaded link maps into an array for sorting.  */
>>> +  size_t count = 0;
>>> +  for (struct link_map *l = new_map; l != NULL; l = l->l_next)
>>> +    count += !l->l_eh_frame_processed;
>>> +  struct link_map **map_array = malloc (count * sizeof (*map_array));
>>
>> Maybe use a scratch_buffer here? It should cover 128 entries for 64-bit,
> 
> This is only used for dlopen, and dlopen already has many malloc calls.
> I don't think it matters.

Ok.

>>> diff --git a/elf/dl-find_eh_frame.h b/elf/dl-find_eh_frame.h
>>> new file mode 100644
>>> index 0000000000..4bde9b14db
>>> --- /dev/null
>>> +++ b/elf/dl-find_eh_frame.h
> 
>>> +/* Extract the exception handling data from a link map and writes it
>>> +   to *INFO.  If no such data is available INFO->eh_frame will be
>>> +   NULL.  */
>>> +static void __attribute__ ((unused))
>>
>> Maybe use inline here and let compiler optimize?
> 
> I don't think inlining is beneficial.
> 
>> Or move to the be a proper function?
> 
> It's also used from tests, so it's convenient to have it in a header.

I usually don't like to resort extra attributes to avoid compiler warnings
when we do have a standard solution. But I don't have a strong preference
here.

> 
>>> +/* This function is similar to _dl_find_eh_frame, but travers the link
>>> +   maps directly.  It is used from audit modules before
>>> +   _dl_find_eh_frame_init has been called, and for testing.  */
>>> +static void *
>>> +_dl_find_eh_frame_slow (void *pc
>>> +#if DL_FIND_EH_FRAME_DBASE
>>> +                   , void **dbase
>>> +#endif
>>> +                        )
>>> +{
>>> +  ElfW(Addr) addr = (ElfW(Addr)) pc;
>>> +  for (Lmid_t ns = 0; ns < GL(dl_nns); ++ns)
>>> +    for (struct link_map *l = GL(dl_ns)[ns]._ns_loaded; l != NULL;
>>> +         l = l->l_next)
>>> +      if (addr >= l->l_map_start && addr < l->l_map_end
>>> +          && (l->l_contiguous || _dl_addr_inside_object (l, addr)))
>>> +        {
>>> +          assert (ns == l->l_ns);
>>> +          struct dl_eh_frame_info info;
>>> +          _dl_get_eh_frame (l, &info);
>>> +#if DL_FIND_EH_FRAME_DBASE
>>> +          *dbase = info.dbase;
>>> +#endif
>>> +          return info.eh_frame;
>>> +        }
>>> +
>>> +  /* Object not found.  */
>>> +  return NULL;
>>> +}
>>
>> This is only used on dl-find-eh_frame.c, so there is no much gain in
>> adding header to define it.
> 
> I expected it to use it in tests.

But currently it is not used, right?  The tests seems to use solely
_dl_find_eh_frame().

Do you plan to add other tests to stress it?  Also, it always called
on initialization, so you can indirectly call it through 
_dl_find_eh_frame().

> 
>>> +@deftypefun {void *} _dl_find_eh_frame (void *@var{pc})
>>> +@standards{GNU, dlfcn.h}
>>> +@safety{@mtsafe{}@assafe{}@acsafe{}}
>>> +This function returns a pointer to the unwinding information for the
>>> +object that contains the program coder @var{pc}.  If the platform uses
>>> +DWARF unwinding information, this is the in-memory address of the
>>> +@code{PT_GNU_EH_FRAME} segment.
>>> +
>>> +In case @var{pc} resides in an object that lacks unwinding information,
>>> +the function returns @code{NULL}.  If no object matches @var{pc},
>>> +@code{NULL} is returned as well.
>>> +
>>> +@code{_dl_find_eh_frame} itself is thread-safe.  However, if the
>>> +application invokes @code{dlclose} for the object that contains @var{pc}
>>> +concurrently with @code{_dl_find_eh_frame} or after the call returns,
>>> +accessing the unwinding data for that object is not safe.  Therefore,
>>> +the application needs to ensure by other means (e.g., by convention)
>>> +that @var{pc} remains a valid code address while the unwinding
>>> +information is processed.
>>> +
>>> +This function is a GNU extension.
>>
>> In the code you state this interface should be async-signa-safe, but there it
>> no explicit documentation about it.
> 
> It says so in the header:
> 
> +@safety{@mtsafe{}@assafe{}@acsafe{}}

Oops, thanks.

> 
>>> diff --git a/sysdeps/i386/bits/dlfcn_eh_frame.h 
>>> b/sysdeps/i386/bits/dlfcn_eh_frame.h
>>> new file mode 100644
>>> index 0000000000..98f6b37029
>>> --- /dev/null
>>> +++ b/sysdeps/i386/bits/dlfcn_eh_frame.h
> 
>>> +#ifndef _DLFCN_H
>>> +# error "Never use <bits/dlfcn_eh_frame.h> directly; include <dlfcn.h> 
>>> instead."
>>> +#endif
>>> +
>>> +/* This implementation uses a DBASE pointer argument in
>>> +   _dl_find_eh_frame.  */
>>> +#define DL_FIND_EH_FRAME_DBASE 1
>>
>> Maybe move DL_FIND_EH_FRAME_DBASE to its own header and sets the expected
>> _dl_find_eh_frame() prototype based on it (so you don't need to replicate
>> it on i386 and nios2).
> 
> I think I wrote elsewhere that I could glibc/GCC settle on a different
> interface eventually, e.g.
> 
> struct dl_object_info
> {
>   uint64_t flags; /* Would indicate NODELETE/otherwise persistent mappings.  
> */
>   struct link_map *map;
>   void *begin;
>   void *end;
>   void *eh_frame;
> #if DL_OBJECT_INFO_DBASE
>   void *dbase;
> #endif
> #if DL_OBJECT_INFO_PCOUNT
>   unsigned long int pcount;  /* Used on Arm, see __gnu_Unwind_Find_exidx.  */
> #endif
> };
> 
> int _dl_find_object (void *address, struct dl_object_info *result);
> 
> This interface would allow some additional optimizations in libgcc_s
> (e.g., cache its own eh_frame value, and the libstdc++ eh_frame value if
> it's a persistent mapping).  The object boundary information could be
> useful in other cases.  And it's trivial to implement
> _dl_find_dso_for_object on top of it, eliminating a potential bottleneck
> from caller-sensitive functions.
> 
> Thanks,
> Florian
> 

Reply via email to