* H. J. Lu:
> For TLS calls:
>
> 1. UNSPEC_TLS_GD:
>
> (parallel [
> (set (reg:DI 0 ax)
> (call:DI (mem:QI (symbol_ref:DI ("__tls_get_addr")))
> (const_int 0 [0])))
> (unspec:DI [(symbol_ref:DI ("e") [flags 0x50])
> (reg/f:DI 7 sp)] UNSPEC_TLS_GD)
> (clobber (reg:DI 5 di))])
>
> 2. UNSPEC_TLS_LD_BASE:
>
> (parallel [
> (set (reg:DI 0 ax)
> (call:DI (mem:QI (symbol_ref:DI ("__tls_get_addr")))
> (const_int 0 [0])))
> (unspec:DI [(reg/f:DI 7 sp)] UNSPEC_TLS_LD_BASE)])
>
> 3. UNSPEC_TLSDESC:
>
> (parallel [
> (set (reg/f:DI 104)
> (plus:DI (unspec:DI [
> (symbol_ref:DI ("_TLS_MODULE_BASE_") [flags 0x10])
> (reg:DI 114)
> (reg/f:DI 7 sp)] UNSPEC_TLSDESC)
> (const:DI (unspec:DI [
> (symbol_ref:DI ("e") [flags 0x1a])
> ] UNSPEC_DTPOFF))))
> (clobber (reg:CC 17 flags))])
>
> (parallel [
> (set (reg:DI 101)
> (unspec:DI [(symbol_ref:DI ("e") [flags 0x50])
> (reg:DI 112)
> (reg/f:DI 7 sp)] UNSPEC_TLSDESC))
> (clobber (reg:CC 17 flags))])
>
> they return the same value for the same input value. But multiple calls
> with the same input value may be generated for simple programs like:
>
> void a(long *);
> int b(void);
> void c(void);
> static __thread long e;
> long
> d(void)
> {
> a(&e);
> if (b())
> c();
> return e;
> }
>
> When compiled with -O2 -fPIC -mtls-dialect=gnu2, the following codes are
> generated:
>
> .type d, @function
> d:
> .LFB0:
> .cfi_startproc
> pushq %rbx
> .cfi_def_cfa_offset 16
> .cfi_offset 3, -16
> leaq e@TLSDESC(%rip), %rbx
> movq %rbx, %rax
> call *e@TLSCALL(%rax)
> addq %fs:0, %rax
> movq %rax, %rdi
> call a@PLT
> call b@PLT
> testl %eax, %eax
> jne .L8
> movq %rbx, %rax
> call *e@TLSCALL(%rax)
> popq %rbx
> .cfi_remember_state
> .cfi_def_cfa_offset 8
> movq %fs:(%rax), %rax
> ret
> .p2align 4,,10
> .p2align 3
> .L8:
> .cfi_restore_state
> call c@PLT
> movq %rbx, %rax
> call *e@TLSCALL(%rax)
> popq %rbx
> .cfi_def_cfa_offset 8
> movq %fs:(%rax), %rax
> ret
> .cfi_endproc
>
> There are 3 "call *e@TLSCALL(%rax)". They all return the same value.
> Rename the remove_redundant_vector pass to the x86_cse pass, for 64bit,
> extend it to also remove redundant TLS calls to generate:
>
> d:
> .LFB0:
> .cfi_startproc
> pushq %rbx
> .cfi_def_cfa_offset 16
> .cfi_offset 3, -16
> leaq e@TLSDESC(%rip), %rax
> movq %fs:0, %rdi
> call *e@TLSCALL(%rax)
> addq %rax, %rdi
> movq %rax, %rbx
> call a@PLT
> call b@PLT
> testl %eax, %eax
> jne .L8
> movq %fs:(%rbx), %rax
> popq %rbx
> .cfi_remember_state
> .cfi_def_cfa_offset 8
> ret
> .p2align 4,,10
> .p2align 3
> .L8:
> .cfi_restore_state
> call c@PLT
> movq %fs:(%rbx), %rax
> popq %rbx
> .cfi_def_cfa_offset 8
> ret
> .cfi_endproc
>
> with only one "call *e@TLSCALL(%rax)". This reduces the number of
> __tls_get_addr calls in libgcc.a by 72%:
>
> __tls_get_addr calls before after
> libgcc.a 868 243
While this is certainly nice, it does not make it harder to resume
coroutines/fibers on a different from what they were suspended on. I
do not know to what extent that was previously supported for
global-dynamic TLS. I recall there be other caching going on (and
certainly for errno because __errno_location is declared const).
If this impacts software like QEMU, is there a way to get back the old
behavior?