On Sun, May 10, 2015 at 10:01 AM, Sriraman Tallam <tmsri...@google.com> wrote: > > On Sun, May 10, 2015, 8:19 AM H.J. Lu <hjl.to...@gmail.com> wrote: > > On Sat, May 9, 2015 at 9:34 AM, H.J. Lu <hjl.to...@gmail.com> wrote: >> On Mon, May 4, 2015 at 7:45 AM, Michael Matz <m...@suse.de> wrote: >>> Hi, >>> >>> On Thu, 30 Apr 2015, Sriraman Tallam wrote: >>> >>>> We noticed that one of our benchmarks sped-up by ~1% when we eliminated >>>> PLT stubs for some of the hot external library functions like memcmp, >>>> pow. The win was from better icache and itlb performance. The main >>>> reason was that the PLT stubs had no spatial locality with the >>>> call-sites. I have started looking at ways to tell the compiler to >>>> eliminate PLT stubs (in-effect inline them) for specified external >>>> functions, for x86_64. I have a proposal and a patch and I would like to >>>> hear what you think. >>>> >>>> This comes with caveats. This cannot be generally done for all >>>> functions marked extern as it is impossible for the compiler to say if a >>>> function is "truly extern" (defined in a shared library). If a function >>>> is not truly extern(ends up defined in the final executable), then >>>> calling it indirectly is a performance penalty as it could have been a >>>> direct call. >>> >>> This can be fixed by Alans idea. >>> >>>> Further, the newly created GOT entries are fixed up at >>>> start-up and do not get lazily bound. >>> >>> And this can be fixed by some enhancements in the linker and dynamic >>> linker. The idea is to still generate a PLT stub and make its GOT entry >>> point to it initially (like a normal got.plt slot). Then the first >>> indirect call will use the address of PLT entry (starting lazy >>> resolution) >>> and update the GOT slot with the real address, so further indirect calls >>> will directly go to the function. >>> >>> This requires a new asm marker (and hence new reloc) as normally if >>> there's a GOT slot it's filled by the real symbols address, unlike if >>> there's only a got.plt slot. E.g. a >>> >>> call *foo@GOTPLT(%rip) >>> >>> would generate a GOT slot (and fill its address into above call insn), >>> but >>> generate a JUMP_SLOT reloc in the final executable, not a GLOB_DAT one. >>> >> >> I added the "relax" prefix support to x86 assembler on users/hjl/relax >> branch >> >> at >> >> https://sourceware.org/git/?p=binutils-gdb.git;a=summary >> >> [hjl@gnu-tools-1 relax-3]$ cat r.S >> .text >> relax jmp foo >> relax call foo >> relax jmp foo@plt >> relax call foo@plt >> [hjl@gnu-tools-1 relax-3]$ ./as -o r.o r.S >> [hjl@gnu-tools-1 relax-3]$ ./objdump -drw r.o >> >> r.o: file format elf64-x86-64 >> >> >> Disassembly of section .text: >> >> 0000000000000000 <.text>: >> 0: 66 e9 00 00 00 00 data16 jmpq 0x6 2: R_X86_64_RELAX_PC32 foo-0x4 >> 6: 66 e8 00 00 00 00 data16 callq 0xc 8: R_X86_64_RELAX_PC32 >> foo-0x4 >> c: 66 e9 00 00 00 00 data16 jmpq 0x12 e: >> R_X86_64_RELAX_PLT32foo-0x4 >> 12: 66 e8 00 00 00 00 data16 callq 0x18 14: >> R_X86_64_RELAX_PLT32foo-0x4 >> [hjl@gnu-tools-1 relax-3]$ >> >> Right now, the relax relocations are treated as PC32/PLT32 relocations. >> I am working on linker support. >> > > I implemented the linker support for x86-64: > > 00000000 <main>: > 0: 48 83 ec 08 sub $0x8,%rsp > 4: e8 00 00 00 00 callq 9 <main+0x9> 5: R_X86_64_PC32 plt-0x4 > 9: e8 00 00 00 00 callq e <main+0xe> a: R_X86_64_PLT32 plt-0x4 > e: e8 00 00 00 00 callq 13 <main+0x13> f: R_X86_64_PC32 bar-0x4 > 13: 66 e8 00 00 00 00 data16 callq 19 <main+0x19> 15: > R_X86_64_RELAX_PC32 bar-0x4 > 19: 66 e8 00 00 00 00 data16 callq 1f <main+0x1f> 1b: > R_X86_64_RELAX_PLT32 bar-0x4 > 1f: 66 e8 00 00 00 00 data16 callq 25 <main+0x25> 21: > R_X86_64_RELAX_PC32 foo-0x4 > 25: 66 e8 00 00 00 00 data16 callq 2b <main+0x2b> 27: > R_X86_64_RELAX_PLT32 foo-0x4 > 2b: 31 c0 xor %eax,%eax > 2d: 48 83 c4 08 add $0x8,%rsp > 31: c3 retq > > 00400460 <main>: > 400460: 48 83 ec 08 sub $0x8,%rsp > 400464: e8 d7 ff ff ff callq 400440 <plt@plt> > 400469: e8 d2 ff ff ff callq 400440 <plt@plt> > 40046e: e8 ad ff ff ff callq 400420 <bar@plt> > 400473: ff 15 ff 03 20 00 callq *0x2003ff(%rip) # 600878 > <_DYNAMIC+0xf8> > 400479: ff 15 f9 03 20 00 callq *0x2003f9(%rip) # 600878 > <_DYNAMIC+0xf8> > 40047f: 66 e8 f3 00 00 00 data16 callq 400578 <foo> > 400485: 66 e8 ed 00 00 00 data16 callq 400578 <foo> > 40048b: 31 c0 xor %eax,%eax > 40048d: 48 83 c4 08 add $0x8,%rsp > 400491: c3 retq > > Sriraman, can you give it a try?
I like HJ's proposal here and it is important that the linker fixes unnecessary indirect calls to direct ones. However, independently I think my original proposal is still useful and I want to pitch it again for the following reasons. AFAIU, Alexander Monakov's -fno-plt does not solve the following: * Does not do anything for non-PIC code. The compiler does not generate a @PLT call but the linker will route all external calls via PLT. We noticed a problem with non-PIC executables where the PLT stubs were causing too many icache misses and are interested in a solution for this. * Aggressively uses indirect calls even if the final symbol is not truly external. This needs HJ's linker patch to fix unnecessary indirect calls to direct calls. My original proposal, for x86_64 only, was to add -fno-plt=<function-name>. This lets the user decide for which functions PLT must be avoided. Let the compiler always generate an indirect call using call *func@GOTPCREL(%rip). We could do this for non-PIC code too. No need for linker fixups since this relies on the user to know that func is from a shared object. I am reattaching the patch. Thanks Sri > > Thanks! Will do! > > Sri > > -- > H.J. > >
* common.opt (-fno-plt=): New option. * config/i386/i386.c (avoid_plt_to_call): New function. (ix86_output_call_insn): Check if PLT needs to be avoided and call or jump indirectly if true. * opts-global.c (htab_str_eq): New function. (avoid_plt_fnsymbol_names_tab): New htab. (handle_common_deferred_options): Handle -fno-plt= Index: common.opt =================================================================== --- common.opt (revision 222892) +++ common.opt (working copy) @@ -1087,6 +1087,11 @@ fdbg-cnt= Common RejectNegative Joined Var(common_deferred_options) Defer -fdbg-cnt=<counter>:<limit>[,<counter>:<limit>,...] Set the debug counter limit. +fno-plt= +Common RejectNegative Joined Var(common_deferred_options) Defer +-fno-plt=<symbol1> Avoid going through the PLT when calling the specified function. +Allow multiple instances of this option with different function names. + fdebug-prefix-map= Common Joined RejectNegative Var(common_deferred_options) Defer Map one directory name to another in debug information Index: config/i386/i386.c =================================================================== --- config/i386/i386.c (revision 222892) +++ config/i386/i386.c (working copy) @@ -25282,6 +25282,25 @@ ix86_expand_call (rtx retval, rtx fnaddr, rtx call return call; } +extern htab_t avoid_plt_fnsymbol_names_tab; +/* If the function referenced by call_op is to a external function + and calls via PLT must be avoided as specified by -fno-plt=, then + return true. */ + +static int +avoid_plt_to_call(rtx call_op) +{ + const char *name; + if (GET_CODE (call_op) != SYMBOL_REF + || SYMBOL_REF_LOCAL_P (call_op) + || avoid_plt_fnsymbol_names_tab == NULL) + return 0; + name = XSTR (call_op, 0); + if (htab_find_slot (avoid_plt_fnsymbol_names_tab, name, NO_INSERT) != NULL) + return 1; + return 0; +} + /* Output the assembly for a call instruction. */ const char * @@ -25294,7 +25313,12 @@ ix86_output_call_insn (rtx insn, rtx call_op) if (SIBLING_CALL_P (insn)) { if (direct_p) - xasm = "jmp\t%P0"; + { + if (avoid_plt_to_call (call_op)) + xasm = "jmp\t*%p0@GOTPCREL(%%rip)"; + else + xasm = "jmp\t%P0"; + } /* SEH epilogue detection requires the indirect branch case to include REX.W. */ else if (TARGET_SEH) @@ -25346,9 +25370,15 @@ ix86_output_call_insn (rtx insn, rtx call_op) } if (direct_p) - xasm = "call\t%P0"; + { + if (avoid_plt_to_call (call_op)) + xasm = "call\t*%p0@GOTPCREL(%%rip)"; + else + xasm = "call\t%P0"; + } else xasm = "call\t%A0"; + output_asm_insn (xasm, &call_op); Index: opts-global.c =================================================================== --- opts-global.c (revision 222892) +++ opts-global.c (working copy) @@ -47,6 +47,7 @@ along with GCC; see the file COPYING3. If not see #include "xregex.h" #include "attribs.h" #include "stringpool.h" +#include "hash-table.h" typedef const char *const_char_p; /* For DEF_VEC_P. */ @@ -420,6 +421,17 @@ decode_options (struct gcc_options *opts, struct g finish_options (opts, opts_set, loc); } +/* Helper function for the hash table that compares the + existing entry (S1) with the given string (S2). */ + +static int +htab_str_eq (const void *s1, const void *s2) +{ + return !strcmp ((const char *)s1, (const char *) s2); +} + +htab_t avoid_plt_fnsymbol_names_tab = NULL; + /* Process common options that have been deferred until after the handlers have been called for all options. */ @@ -539,6 +551,15 @@ handle_common_deferred_options (void) stack_limit_rtx = gen_rtx_SYMBOL_REF (Pmode, ggc_strdup (opt->arg)); break; + case OPT_fno_plt_: + void **slot; + if (avoid_plt_fnsymbol_names_tab == NULL) + avoid_plt_fnsymbol_names_tab = htab_create (10, htab_hash_string, + htab_str_eq, NULL); + slot = htab_find_slot (avoid_plt_fnsymbol_names_tab, opt->arg, INSERT); + *slot = (void *)opt->arg; + break; + default: gcc_unreachable (); }