https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122118

            Bug ID: 122118
           Summary: Combining writes causes unnecessary runtime
                    relocations
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: andreas.rheinhardt at outlook dot com
  Target Milestone: ---

Created attachment 62481
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=62481&action=edit
small file with the same testcase as the bug report

static int a, b;
void init(int *ptrs[]) {
    ptrs[0] = &a;
    ptrs[1] = &b;
}
__attribute__((cold)) void init_cold(int *ptrs[]) {
    ptrs[0] = &a;
    ptrs[1] = &b;
}

When the above code is compiled for x64 with GCC from today
(c6865e7e15bc9a1337df00d2ca03604e1712a2dd) with -fpic -c -O3, it results in

Disassembly of section .text:

0000000000000000 <init>:
   0:   f3 0f 7e 05 00 00 00    movq   0x0(%rip),%xmm0        # 8 <init+0x8>
   7:   00 
                        4: R_X86_64_PC32        .data.rel.ro.local-0x4
   8:   48 8d 05 00 00 00 00    lea    0x0(%rip),%rax        # f <init+0xf>
                        b: R_X86_64_PC32        .bss-0x4
   f:   66 48 0f 6e c8          movq   %rax,%xmm1
  14:   66 0f 6c c1             punpcklqdq %xmm1,%xmm0
  18:   0f 11 07                movups %xmm0,(%rdi)
  1b:   c3                      ret

Disassembly of section .text.unlikely:

0000000000000000 <init_cold>:
   0:   f3 0f 7e 05 00 00 00    movq   0x0(%rip),%xmm0        # 8
<init_cold+0x8>
   7:   00 
                        4: R_X86_64_PC32        .data.rel.ro.local-0x4
   8:   48 8d 05 00 00 00 00    lea    0x0(%rip),%rax        # f
<init_cold+0xf>
                        b: R_X86_64_PC32        .bss-0x4
   f:   66 48 0f 6e c8          movq   %rax,%xmm1
  14:   66 0f 6c c1             punpcklqdq %xmm1,%xmm0
  18:   0f 11 07                movups %xmm0,(%rdi)
  1b:   c3                      ret

As can be seen, GCC combines the writes and this causes issues:
1. The most important one is that the first load can't be relaxed to lea at
all. It will put a pointer into .data.rel.ro and add a runtime relocation for
this pointer, both of which are of course completely unnecessary. (Given that a
and b are static here, creating code that needs relaxing by the linker would be
suboptimal in itself.)
2. When additionally combined with -fno-tree-vectorize, the stores are not
combined and the code for both functions looks like this:

   0:   48 8d 05 00 00 00 00    lea    0x0(%rip),%rax        # 7
<init_cold+0x7>
                        3: R_X86_64_PC32        .bss
   7:   48 89 07                mov    %rax,(%rdi)
   a:   48 8d 05 00 00 00 00    lea    0x0(%rip),%rax        # 11
<init_cold+0x11>
                        d: R_X86_64_PC32        .bss-0x4
  11:   48 89 47 08             mov    %rax,0x8(%rdi)
  15:   c3                      ret

This is six bytes shorter than the function compiled with default vectorization
arguments. I'd expect that this alone means that this version is preferred for
the function with __attribute__((cold)), but it isn't.
3. I would be surprised if the vectorized version of this code is actually
faster than the unvectorized one: It performs an actual load which can be slow
if the cache line accessed is not already in a cpu cache; it has to combine the
pointers; the xmm store may cross a cache line which is slow on old cpus.

Using distro versions of GCC shows a change between GCC 13 and GCC 14. With GCC
13 it did this:
   0:   48 8d 05 00 00 00 00    lea    0x0(%rip),%rax        # 7
<init_cold+0x7>
                        3: R_X86_64_PC32        .bss-0x4
   7:   48 8d 15 00 00 00 00    lea    0x0(%rip),%rdx        # e
<init_cold+0xe>
                        a: R_X86_64_PC32        .bss
   e:   66 48 0f 6e c2          movq   %rdx,%xmm0
  13:   66 48 0f 6e c8          movq   %rax,%xmm1
  18:   66 0f 6c c1             punpcklqdq %xmm1,%xmm0
  1c:   0f 11 07                movups %xmm0,(%rdi)
  1f:   c3                      ret
Still using a vector store, but without runtime relocations.

PS: By the way, if GCC thinks that using an xmm store is beneficial here, then
why does it not initialize xmm1 the same way as xmm0 instead of going through
rax?

Reply via email to