[Bug c/122118] New: Combining writes causes unnecessary runtime relocations

andreas.rheinhardt at outlook dot com via Gcc-bugs Wed, 01 Oct 2025 12:39:01 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122118


            Bug ID: 122118
           Summary: Combining writes causes unnecessary runtime
                    relocations
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: andreas.rheinhardt at outlook dot com
  Target Milestone: ---

Created attachment 62481
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=62481&action=edit
small file with the same testcase as the bug report

static int a, b;
void init(int *ptrs[]) {
    ptrs[0] = &a;
    ptrs[1] = &b;
}
__attribute__((cold)) void init_cold(int *ptrs[]) {
    ptrs[0] = &a;
    ptrs[1] = &b;
}

When the above code is compiled for x64 with GCC from today
(c6865e7e15bc9a1337df00d2ca03604e1712a2dd) with -fpic -c -O3, it results in

Disassembly of section .text:

0000000000000000 <init>:
   0:   f3 0f 7e 05 00 00 00    movq   0x0(%rip),%xmm0        # 8 <init+0x8>
   7:   00 
                        4: R_X86_64_PC32        .data.rel.ro.local-0x4
   8:   48 8d 05 00 00 00 00    lea    0x0(%rip),%rax        # f <init+0xf>
                        b: R_X86_64_PC32        .bss-0x4
   f:   66 48 0f 6e c8          movq   %rax,%xmm1
  14:   66 0f 6c c1             punpcklqdq %xmm1,%xmm0
  18:   0f 11 07                movups %xmm0,(%rdi)
  1b:   c3                      ret

Disassembly of section .text.unlikely:

0000000000000000 <init_cold>:
   0:   f3 0f 7e 05 00 00 00    movq   0x0(%rip),%xmm0        # 8
<init_cold+0x8>
   7:   00 
                        4: R_X86_64_PC32        .data.rel.ro.local-0x4
   8:   48 8d 05 00 00 00 00    lea    0x0(%rip),%rax        # f
<init_cold+0xf>
                        b: R_X86_64_PC32        .bss-0x4
   f:   66 48 0f 6e c8          movq   %rax,%xmm1
  14:   66 0f 6c c1             punpcklqdq %xmm1,%xmm0
  18:   0f 11 07                movups %xmm0,(%rdi)
  1b:   c3                      ret

As can be seen, GCC combines the writes and this causes issues:
1. The most important one is that the first load can't be relaxed to lea at
all. It will put a pointer into .data.rel.ro and add a runtime relocation for
this pointer, both of which are of course completely unnecessary. (Given that a
and b are static here, creating code that needs relaxing by the linker would be
suboptimal in itself.)
2. When additionally combined with -fno-tree-vectorize, the stores are not
combined and the code for both functions looks like this:

   0:   48 8d 05 00 00 00 00    lea    0x0(%rip),%rax        # 7
<init_cold+0x7>
                        3: R_X86_64_PC32        .bss
   7:   48 89 07                mov    %rax,(%rdi)
   a:   48 8d 05 00 00 00 00    lea    0x0(%rip),%rax        # 11
<init_cold+0x11>
                        d: R_X86_64_PC32        .bss-0x4
  11:   48 89 47 08             mov    %rax,0x8(%rdi)
  15:   c3                      ret

This is six bytes shorter than the function compiled with default vectorization
arguments. I'd expect that this alone means that this version is preferred for
the function with __attribute__((cold)), but it isn't.
3. I would be surprised if the vectorized version of this code is actually
faster than the unvectorized one: It performs an actual load which can be slow
if the cache line accessed is not already in a cpu cache; it has to combine the
pointers; the xmm store may cross a cache line which is slow on old cpus.

Using distro versions of GCC shows a change between GCC 13 and GCC 14. With GCC
13 it did this:
   0:   48 8d 05 00 00 00 00    lea    0x0(%rip),%rax        # 7
<init_cold+0x7>
                        3: R_X86_64_PC32        .bss-0x4
   7:   48 8d 15 00 00 00 00    lea    0x0(%rip),%rdx        # e
<init_cold+0xe>
                        a: R_X86_64_PC32        .bss
   e:   66 48 0f 6e c2          movq   %rdx,%xmm0
  13:   66 48 0f 6e c8          movq   %rax,%xmm1
  18:   66 0f 6c c1             punpcklqdq %xmm1,%xmm0
  1c:   0f 11 07                movups %xmm0,(%rdi)
  1f:   c3                      ret
Still using a vector store, but without runtime relocations.

PS: By the way, if GCC thinks that using an xmm store is beneficial here, then
why does it not initialize xmm1 the same way as xmm0 instead of going through
rax?

[Bug c/122118] New: Combining writes causes unnecessary runtime relocations

Reply via email to