RE: [PATCH v2] [POWERPC] Optimize counting distinct entries in the relocation sections

Medve Emilian Tue, 27 Nov 2007 06:54:29 -0800

Hi Paul,


I'm just wondering what are your latest thoughts about this patch
(http://patchwork.ozlabs.org/linuxppc/patch?id=14707).


Cheers,
Emil.


> -----Original Message-----
> From: Medve Emilian 
> Sent: Tuesday, November 13, 2007 10:24 AM
> To: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
> [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
> [EMAIL PROTECTED]; linuxppc-dev@ozlabs.org; 
> [EMAIL PROTECTED]
> Cc: Medve Emilian
> Subject: [PATCH v2] [POWERPC] Optimize counting distinct 
> entries in the relocation sections
> 
> When a module has relocation sections with tens of thousands 
> worth of entries,
> counting the distinct/unique entries only (i.e. no 
> duplicates) at load time can
> take tens of seconds and up to minutes. The sore point is the 
> count_relocs()
> function which is called as part of the architecture specific 
> module loading
> processing path:
> 
>       -> load_module()                        generic
>          -> module_frob_arch_sections()       arch specific
>             -> get_plt_size()         32-bit
>             -> get_stubs_size()       64-bit
>                -> count_relocs()
> 
> (Not sure why the relocation tables could contain lots of 
> duplicates and why
> they are not trimmed at compile time by the linker. In some 
> test cases, out of
> 35K relocation entries only 1.5K were distinct/unique)
> 
> The previous counting algorithm was having O(n^2) complexity. 
> Basically two
> solutions were proposed on the e-mail list: a hash based 
> approach and a sort
> based approach
> 
> The hash based approach is the fastest (O(n)) but the has it 
> needs additional
> memory and for certain corner cases it could take lots of 
> memory due to the
> degeneration of the hash. One such proposal was submitted here:
> 
> http://ozlabs.org/pipermail/linuxppc-dev/2007-June/037641.html
> 
> In this proposal, the symbol + addendum are hashed to 
> generate a key and a
> pointer to the relocation entry will be stored in it. The 
> hash is implemented as
> a linked list of memory pages with PAGE_SIZE / 
> sizeof(Elfxx_Rela *) entries. In
> case of collisions in all the existing pages, a new page is 
> added to the list to
> accommodate the new distinct relocation entry
> 
> For 32-bit PowerPCs with 4K pages, a page can accommodate 1K 
> worth of pointers
> to relocation entries. In the 35K entries scenario, as 
> much/little of six (6)
> pages could be allocated using 24K of extra memory during the 
> module load
> 
> The sort based approach is slower (O(n * log n + n)) but if 
> the sorting is done
> "in place" it doesn't need additional memory. A proposal was 
> submitted here:
> 
> http://ozlabs.org/pipermail/linuxppc-dev/2007-November/045854.html
> (http://patchwork.ozlabs.org/linuxppc/patch?filter=default&id=14573)
> 
> In this proposal an array of pointers to the relocation 
> entries is built and
> then is sorted using the kernel sort() utility function. This 
> is basically a heap
> sort algorithm with a stable O(n * log n) complexity. With 
> this counting the
> distinct/unique entries is just linear (O(n)) complexity. The 
> problem is the
> extra memory needed in this proposal, which in the 35K 
> relocation entries test
> case it can be as much as 140K (for 32-bit PowerPCs; double 
> for 64-bit). This is
> much more then the memory needed by the hash based approach described
> above/earlier but it doesn't hide potential degenerative corner cases
> 
> The current patch is a happy compromise between the two 
> proposals above:
> O(n + n * log n) performance with no additional memory 
> requirements due to
> sorting in place the relocation table itself
> 
> Signed-off-by: Emil Medve <[EMAIL PROTECTED]>
> ---
> 
> This patch only tries to address counting the distinct 
> R_PPC_REL24 entries for
> the purpose of sizing the PLT. This operation was/can be 
> detected by the proper
> kernel logic as a soft lockup for large relocation tables
> 
> A related optimization (that could cause rewriting the this 
> patch) is to
> optimize the PLT search from do_plt_call() but that doesn't 
> seem to be a
> problem right now
> 
> The errors below are false positives due to the fact that 
> Elfxx_Rela are
> falsely assumed to be variables/operands instead of types:
> 
> linux-2.6> scripts/checkpatch.pl 
> 0001-POWERPC-Optimize-counting-distinct-entries-in-the.patch 
> ERROR: need consistent spacing around '*' (ctx:WxV)
> #116: FILE: arch/powerpc/kernel/module_32.c:78:
> +       const Elf32_Rela *x, *y;
>                          ^
> 
> ERROR: need consistent spacing around '*' (ctx:WxV)
> #228: FILE: arch/powerpc/kernel/module_64.c:122:
> +       const Elf64_Rela *x, *y;
>                          ^
> 
> total: 2 errors, 0 warnings, 218 lines checked
> Your patch has style problems, please review.  If any of these errors
> are false positives report them to the maintainer, see
> CHECKPATCH in MAINTAINERS.
> 
>  arch/powerpc/kernel/module_32.c |   77 
> ++++++++++++++++++++++++++++++-------
>  arch/powerpc/kernel/module_64.c |   81 
> ++++++++++++++++++++++++++++++--------
>  2 files changed, 127 insertions(+), 31 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/module_32.c 
> b/arch/powerpc/kernel/module_32.c
> index 07a89a3..eab3138 100644
> --- a/arch/powerpc/kernel/module_32.c
> +++ b/arch/powerpc/kernel/module_32.c
> @@ -24,6 +24,7 @@
>  #include <linux/kernel.h>
>  #include <linux/cache.h>
>  #include <linux/bug.h>
> +#include <linux/sort.h>
>  
>  #include "setup.h"
>  
> @@ -54,22 +55,60 @@ void module_free(struct module *mod, void 
> *module_region)
>     addend) */
>  static unsigned int count_relocs(const Elf32_Rela *rela, 
> unsigned int num)
>  {
> -     unsigned int i, j, ret = 0;
> -
> -     /* Sure, this is order(n^2), but it's usually short, and not
> -           time critical */
> -     for (i = 0; i < num; i++) {
> -             for (j = 0; j < i; j++) {
> -                     /* If this addend appeared before, it's
> -                           already been counted */
> -                     if (ELF32_R_SYM(rela[i].r_info)
> -                         == ELF32_R_SYM(rela[j].r_info)
> -                         && rela[i].r_addend == rela[j].r_addend)
> -                             break;
> +     unsigned int i, r_info, r_addend, _count_relocs;
> +
> +     _count_relocs = 0;
> +     r_info = 0;
> +     r_addend = 0;
> +     for (i = 0; i < num; i++)
> +             /* Only count 24-bit relocs, others don't need stubs */
> +             if (ELF32_R_TYPE(rela[i].r_info) == R_PPC_REL24 &&
> +                 (r_info != ELF32_R_SYM(rela[i].r_info) ||
> +                  r_addend != rela[i].r_addend)) {
> +                     _count_relocs++;
> +                     r_info = ELF32_R_SYM(rela[i].r_info);
> +                     r_addend = rela[i].r_addend;
>               }
> -             if (j == i) ret++;
> +
> +     return _count_relocs;
> +}
> +
> +static int relacmp(const void *_x, const void *_y)
> +{
> +     const Elf32_Rela *x, *y;
> +
> +     y = (Elf32_Rela *)_x;
> +     x = (Elf32_Rela *)_y;
> +
> +     /* Compare the entire r_info (as opposed to 
> ELF32_R_SYM(r_info) only) to
> +      * make the comparison cheaper/faster. It won't affect 
> the sorting or
> +      * the counting algorithms' performance
> +      */
> +     if (x->r_info < y->r_info)
> +             return -1;
> +     else if (x->r_info > y->r_info)
> +             return 1;
> +     else if (x->r_addend < y->r_addend)
> +             return -1;
> +     else if (x->r_addend > y->r_addend)
> +             return 1;
> +     else
> +             return 0;
> +}
> +
> +static void relaswap(void *_x, void *_y, int size)
> +{
> +     uint32_t *x, *y, tmp;
> +     int i;
> +
> +     y = (uint32_t *)_x;
> +     x = (uint32_t *)_y;
> +
> +     for (i = 0; i < sizeof(Elf32_Rela) / sizeof(uint32_t); i++) {
> +             tmp = x[i];
> +             x[i] = y[i];
> +             y[i] = tmp;
>       }
> -     return ret;
>  }
>  
>  /* Get the potential trampolines size required of the init and
> @@ -100,6 +139,16 @@ static unsigned long get_plt_size(const 
> Elf32_Ehdr *hdr,
>                       DEBUGP("Ptr: %p.  Number: %u\n",
>                              (void *)hdr + sechdrs[i].sh_offset,
>                              sechdrs[i].sh_size / sizeof(Elf32_Rela));
> +
> +                     /* Sort the relocation information 
> based on a symbol and
> +                      * addend key. This is a stable O(n*log 
> n) complexity
> +                      * alogrithm but it will reduce the 
> complexity of
> +                      * count_relocs() to linear complexity O(n)
> +                      */
> +                     sort((void *)hdr + sechdrs[i].sh_offset,
> +                          sechdrs[i].sh_size / sizeof(Elf32_Rela),
> +                          sizeof(Elf32_Rela), relacmp, relaswap);
> +
>                       ret += count_relocs((void *)hdr
>                                            + sechdrs[i].sh_offset,
>                                            sechdrs[i].sh_size
> diff --git a/arch/powerpc/kernel/module_64.c 
> b/arch/powerpc/kernel/module_64.c
> index 75c7c4f..3a82b02 100644
> --- a/arch/powerpc/kernel/module_64.c
> +++ b/arch/powerpc/kernel/module_64.c
> @@ -24,6 +24,7 @@
>  #include <asm/module.h>
>  #include <asm/uaccess.h>
>  #include <asm/firmware.h>
> +#include <linux/sort.h>
>  
>  #include "setup.h"
>  
> @@ -81,25 +82,23 @@ static struct ppc64_stub_entry ppc64_stub =
>     different addend) */
>  static unsigned int count_relocs(const Elf64_Rela *rela, 
> unsigned int num)
>  {
> -     unsigned int i, j, ret = 0;
> +     unsigned int i, r_info, r_addend, _count_relocs;
>  
>       /* FIXME: Only count external ones --RR */
> -     /* Sure, this is order(n^2), but it's usually short, and not
> -           time critical */
> -     for (i = 0; i < num; i++) {
> +     _count_relocs = 0;
> +     r_info = 0;
> +     r_addend = 0;
> +     for (i = 0; i < num; i++)
>               /* Only count 24-bit relocs, others don't need stubs */
> -             if (ELF64_R_TYPE(rela[i].r_info) != R_PPC_REL24)
> -                     continue;
> -             for (j = 0; j < i; j++) {
> -                     /* If this addend appeared before, it's
> -                           already been counted */
> -                     if (rela[i].r_info == rela[j].r_info
> -                         && rela[i].r_addend == rela[j].r_addend)
> -                             break;
> +             if (ELF64_R_TYPE(rela[i].r_info) == R_PPC_REL24 &&
> +                 (r_info != ELF64_R_SYM(rela[i].r_info) ||
> +                  r_addend != rela[i].r_addend)) {
> +                     _count_relocs++;
> +                     r_info = ELF64_R_SYM(rela[i].r_info);
> +                     r_addend = rela[i].r_addend;
>               }
> -             if (j == i) ret++;
> -     }
> -     return ret;
> +
> +     return _count_relocs;
>  }
>  
>  void *module_alloc(unsigned long size)
> @@ -118,6 +117,44 @@ void module_free(struct module *mod, 
> void *module_region)
>             table entries. */
>  }
>  
> +static int relacmp(const void *_x, const void *_y)
> +{
> +     const Elf64_Rela *x, *y;
> +
> +     y = (Elf64_Rela *)_x;
> +     x = (Elf64_Rela *)_y;
> +
> +     /* Compare the entire r_info (as opposed to 
> ELF64_R_SYM(r_info) only) to
> +      * make the comparison cheaper/faster. It won't affect 
> the sorting or
> +      * the counting algorithms' performance
> +      */
> +     if (x->r_info < y->r_info)
> +             return -1;
> +     else if (x->r_info > y->r_info)
> +             return 1;
> +     else if (x->r_addend < y->r_addend)
> +             return -1;
> +     else if (x->r_addend > y->r_addend)
> +             return 1;
> +     else
> +             return 0;
> +}
> +
> +static void relaswap(void *_x, void *_y, int size)
> +{
> +     uint64_t *x, *y, tmp;
> +     int i;
> +
> +     y = (uint64_t *)_x;
> +     x = (uint64_t *)_y;
> +
> +     for (i = 0; i < sizeof(Elf64_Rela) / sizeof(uint64_t); i++) {
> +             tmp = x[i];
> +             x[i] = y[i];
> +             y[i] = tmp;
> +     }
> +}
> +
>  /* Get size of potential trampolines required. */
>  static unsigned long get_stubs_size(const Elf64_Ehdr *hdr,
>                                   const Elf64_Shdr *sechdrs)
> @@ -133,6 +170,16 @@ static unsigned long 
> get_stubs_size(const Elf64_Ehdr *hdr,
>                       DEBUGP("Ptr: %p.  Number: %lu\n",
>                              (void *)sechdrs[i].sh_addr,
>                              sechdrs[i].sh_size / sizeof(Elf64_Rela));
> +
> +                     /* Sort the relocation information 
> based on a symbol and
> +                      * addend key. This is a stable O(n*log 
> n) complexity
> +                      * alogrithm but it will reduce the 
> complexity of
> +                      * count_relocs() to linear complexity O(n)
> +                      */
> +                     sort((void *)sechdrs[i].sh_addr,
> +                          sechdrs[i].sh_size / sizeof(Elf64_Rela),
> +                          sizeof(Elf64_Rela), relacmp, relaswap);
> +
>                       relocs += count_relocs((void 
> *)sechdrs[i].sh_addr,
>                                              sechdrs[i].sh_size
>                                              / sizeof(Elf64_Rela));
> @@ -343,7 +390,7 @@ int apply_relocate_add(Elf64_Shdr *sechdrs,
>                       /* Simply set it */
>                       *(u32 *)location = value;
>                       break;
> -                     
> +
>               case R_PPC64_ADDR64:
>                       /* Simply set it */
>                       *(unsigned long *)location = value;
> @@ -399,7 +446,7 @@ int apply_relocate_add(Elf64_Shdr *sechdrs,
>                       }
>  
>                       /* Only replace bits 2 through 26 */
> -                     *(uint32_t *)location 
> +                     *(uint32_t *)location
>                               = (*(uint32_t *)location & ~0x03fffffc)
>                               | (value & 0x03fffffc);
>                       break;
> -- 
> 1.5.3.GIT
_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

RE: [PATCH v2] [POWERPC] Optimize counting distinct entries in the relocation sections

Reply via email to