The ELF-v2 ISA 3.1 support for Power10 has relocations to optimize cases where the code is references an external variable in only one location. This patch is similar to the optimizations that the linker already does to optimize TOC accesses.
I will be submitting 3 patches as follow-ups to this message: * The first patch adds support for PCREL_OPT loads; * The second patch adds support for PCREL_OPT stores; (and) * The third patch adds the tests. If the program is compiled to be the main program, and the variable is defined in the main program, these relocations will convert loading up the address of the external variable and then doing a load or store using that address to be doing the prefixed load or store directly and converting the second instruction into a NOP. For example, consider the following program: extern int ext_variable; int ret_var (void) { return ext_variable; } void store_var (int i) { ext_variable = i; } Currently on power10, the compiler compiles this as: ret_var: pld 9,ext_variable@got@pcrel lwa 3,0(9) blr store_var: pld 9,ext_variable@got@pcrel stw 3,0(9) blr That is, it loads up the address of 'ext_variable' from the GOT table into register r9, and then uses r9 as a base register to reference the actual variable. The linker does optimize the case where you are compiling the main program, and the variable is also defined in the main program to be: ret_var: pla 9,ext_variable,1 lwa 3,0(9) blr store_var: pla 9,ext_variable,1 stw 3,0(9) blr These patches generate: ret_var: pld 9,ext_variable@got@pcrel .Lpcrel1: .reloc .Lpcrel1-8,R_PPC64_PCREL_OPT,.-(.Lpcrel1-8) lwa 3,0(9) blr store_var: pld 9,ext_variable@got@pcrel .Lpcrel2: .reloc .Lpcrel2-8,R_PPC64_PCREL_OPT,.-(.Lpcrel2-8) stw 3,0(9) blr Note, the label for locating the PLD occurs after the PLD and not before it. This is so that if the assembler adds a NOP in front of the PLD to align it, the relocations will still work. If the linker can, it will convert the code into: ret_var: plwa 3,ext_variable,1 nop blr store_var: pstw 3,ext_variable,1 nop blr These patches allow the load of the address to not be physically adjacent to the actual load or store, which should allow for better code. For loads, there must no references to the register that is being loaded between the PLD and the actual load. For stores, it becomes a little trickier, in that the register being stored must be live at the time the PLD instruction is done, and it must continue to be live and unmodified between the PLD and the store. For both loads and stores, there must be only one reference to the address being loaded into a base register, and that base register must die at the point of the load/store. In order to do this, the pass that converts the load address and load/store must occur late in the compilation cycle. In particular, the second scheduler pass will duplicate and optimize some of the references and it will produce an invalid program. In the past, Segher has said that we should be able to move it earlier. I have my doubts whether that is feasible. What I would like to do is put these patches into GCC 11, which will enable many of the cases that we want to optimize. Then somebody else can take a swing at doing the optimization to allow the code to do this optimization earlier. That way, even if we can't get the super optimized code to work, we at least will get the majority of cases to work. For reference, here is what the current compiler generates for a medium code model system targeting power9 with the TOC support: .section ".toc","aw" .LC0: .quad ext_variable .section ".text" ret_var: .LCF0: 0: addis 2,12,.TOC.-.LCF0@ha addi 2,2,.TOC.-.LCF0@l .localentry ret_var,.-ret_var addis 9,2,.LC0@toc@ha ld 9,.LC0@toc@l(9) lwa 3,0(9) blr .section ".toc","aw" .set .LC1,.LC0 .section ".text" store_var: .LCF1: 0: addis 2,12,.TOC.-.LCF1@ha addi 2,2,.TOC.-.LCF1@l .localentry store_var,.-store_var addis 9,2,.LC1@toc@ha ld 9,.LC1@toc@l(9) stw 3,0(9) blr And the linker optimizes this to: ret_var: lis 2,.TOC@ha addi 2,2,.TOC@l .localentry ret_var,.-ret_var nop ; addis eliminated due to small TOC addi 9,2,<offset> ; ld converted into addi lwa 3,0(9) ; actual load store_var: lis 2,.TOC@ha addi 2,2,.TOC@l .localentry store_var,.-store_var nop ; addis eliminated due to small TOC addi 9,2,<offset> ; ld converted into addi stw 3,0(9) ; actual store -- Michael Meissner, IBM IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA email: meiss...@linux.ibm.com, phone: +1 (978) 899-4797