https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80881
--- Comment #83 from Julian Waters <tanksherman27 at gmail dot com> --- Liu Hao: The registers it's using seem to be all over the place. Prior it was using rdx for the gs:[88] load and rax for everything else, now it's either using any register it can find, or using rdx to store the result of rdx+rax*8. I have no idea why the resulting assembly is so different, but this could mean the resulting program runs less efficiently EDIT: Nevermind, it was because of rax being the return value register and the thread local being an array extern _Thread_local int local; int get(void) { return local; } movl _tls_index(%rip), %eax movq %gs:88, %rdx movq (%rdx,%rax,8), %rax movl local@secrel32(%rax), %eax extern _Thread_local int local[8]; int get(void) { return local[2] + local[4]; } movl _tls_index(%rip), %eax movq %gs:88, %rdx movq (%rdx,%rax,8), %rdx movl 16+local@secrel32(%rdx), %eax addl 8+local@secrel32(%rdx), %eax Uros: I see, I'll try to do so. I was mainly avoiding that to break less code (I have a habit of doing that to anything I touch). Although, the resulting assembly (Barring the register selection) already seems to be as compact as possible for Windows, I'm not sure how using get_thread_pointer could make it any more optimal. This is a genuinely curious question, not placing doubt on whether get_thread_pointer can help optimize the resulting assembly