On Mon, May 4, 2015 at 7:45 AM, Michael Matz <m...@suse.de> wrote: > Hi, > > On Thu, 30 Apr 2015, Sriraman Tallam wrote: > >> We noticed that one of our benchmarks sped-up by ~1% when we eliminated >> PLT stubs for some of the hot external library functions like memcmp, >> pow. The win was from better icache and itlb performance. The main >> reason was that the PLT stubs had no spatial locality with the >> call-sites. I have started looking at ways to tell the compiler to >> eliminate PLT stubs (in-effect inline them) for specified external >> functions, for x86_64. I have a proposal and a patch and I would like to >> hear what you think. >> >> This comes with caveats. This cannot be generally done for all >> functions marked extern as it is impossible for the compiler to say if a >> function is "truly extern" (defined in a shared library). If a function >> is not truly extern(ends up defined in the final executable), then >> calling it indirectly is a performance penalty as it could have been a >> direct call. > > This can be fixed by Alans idea. > >> Further, the newly created GOT entries are fixed up at >> start-up and do not get lazily bound. > > And this can be fixed by some enhancements in the linker and dynamic > linker. The idea is to still generate a PLT stub and make its GOT entry > point to it initially (like a normal got.plt slot). Then the first > indirect call will use the address of PLT entry (starting lazy resolution) > and update the GOT slot with the real address, so further indirect calls > will directly go to the function. > > This requires a new asm marker (and hence new reloc) as normally if > there's a GOT slot it's filled by the real symbols address, unlike if > there's only a got.plt slot. E.g. a > > call *foo@GOTPLT(%rip) > > would generate a GOT slot (and fill its address into above call insn), but > generate a JUMP_SLOT reloc in the final executable, not a GLOB_DAT one. >
I added the "relax" prefix support to x86 assembler on users/hjl/relax branch at https://sourceware.org/git/?p=binutils-gdb.git;a=summary [hjl@gnu-tools-1 relax-3]$ cat r.S .text relax jmp foo relax call foo relax jmp foo@plt relax call foo@plt [hjl@gnu-tools-1 relax-3]$ ./as -o r.o r.S [hjl@gnu-tools-1 relax-3]$ ./objdump -drw r.o r.o: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <.text>: 0: 66 e9 00 00 00 00 data16 jmpq 0x6 2: R_X86_64_RELAX_PC32 foo-0x4 6: 66 e8 00 00 00 00 data16 callq 0xc 8: R_X86_64_RELAX_PC32 foo-0x4 c: 66 e9 00 00 00 00 data16 jmpq 0x12 e: R_X86_64_RELAX_PLT32foo-0x4 12: 66 e8 00 00 00 00 data16 callq 0x18 14: R_X86_64_RELAX_PLT32foo-0x4 [hjl@gnu-tools-1 relax-3]$ Right now, the relax relocations are treated as PC32/PLT32 relocations. I am working on linker support. -- H.J.