https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89557
Bug ID: 89557
Summary: [7/8 regression] 4*movq to 2*movaps IPC performance
regression on znver1
Product: gcc
Version: 8.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c++
Assignee: unassigned at gcc dot gnu.org
Reporter: 0xe2.0x9a.0x9b at gmail dot com
Target Milestone: ---
Approximate C++ source code:
struct __attribute__((aligned(16))) A {
union {
struct {
uint64_t a;
double b;
};
uint64_t data[2];
};
};
A a;
a.a = 2;
a.b = x*y;
return a;
CPU: AMD Ryzen 5 1600 Six-Core Processor
GCC 7.4.0 generates (no -march/mtune):
movq $2, 0x80(%rsp)
movsd %xmm0, 0x88(%rsp)
mov 0x80(%rsp), %rax
mov 0x88(%rsp), %rdx
mov %rax, 0x30(%rsp)
mov %rdx, 0x38(%rsp)
GCC 7.4.0 generates (no -march, -mtune=native):
movq $2, 0x80(%rsp)
movsd %xmm0, 0x88(%rsp)
movaps 0x80(%rsp), %xmm6
movaps %xmm6, 0x30(%rsp)
GCC 8.2.0 generates (no -march/mtune):
movq $2, 0x80(%rsp)
movsd %xmm0, 0x88(%rsp)
movdqa 0x80(%rsp), %xmm6
movaps %xmm6, 0x30(%rsp)
GCC 8.2.0 generates (no -march, -mtune=native):
movq $2, 0x80(%rsp)
movsd %xmm0, 0x88(%rsp)
movaps 0x80(%rsp), %xmm6
movaps %xmm6, 0x30(%rsp)
IPC of an executable which uses the above code (perf stat):
GCC 7.4.0 (no -march/mtune):
617.233116 task-clock (msec) # 0.997 CPUs utilized
4,139,124,553 instructions # 1.94 insn per cycle
GCC 7.4.0 (no -march, -mtune=native):
1106.252920 task-clock (msec) # 1.000 CPUs utilized
3,995,268,509 instructions # 1.02 insn per cycle
GCC 8.2.0 (no -march/mtune):
1096.852485 task-clock (msec) # 1.000 CPUs utilized
3,790,839,401 instructions # 0.97 insn per cycle
GCC 8.2.0 (no -march, -mtune=native):
1105.693441 task-clock (msec) # 1.000 CPUs utilized
4,041,957,928 instructions # 1.04 insn per cycle
Summary: Using 2*movaps instead of 4*movq severely lowers IPC on znver1 CPUs