https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849
Jan Hubicka <hubicka at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Blocks| |109811 CC| |mjambor at suse dot cz --- Comment #6 from Jan Hubicka <hubicka at gcc dot gnu.org> --- Here is slightly improved testcase which actually pushes into stack and measures something. It test loops 1000 times and returns. It also makes stack to be local variable so race conditions are not a problem. #include <vector> typedef unsigned int uint32_t; std::pair<uint32_t, uint32_t> pair; void test() { std::vector<std::pair<uint32_t, uint32_t>> stack; stack.push_back (pair); while (!stack.empty()) { std::pair<uint32_t, uint32_t> cur = stack.back(); stack.pop_back(); if (!cur.first) { cur.second++; stack.push_back (cur); } if (cur.second > 10000) break; } } int main() { for (int i = 0; i < 10000; i++) test(); } Clang code is about twice as fast jan@localhost:/tmp> clang++ -O2 tt.C -fno-exceptions jan@localhost:/tmp> g++ -O2 tt.C -fno-exceptions -o a.out-gcc jan@localhost:/tmp> perf stat ./a.out Performance counter stats for './a.out': 434.24 msec task-clock:u # 0.997 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 129 page-faults:u # 297.073 /sec 1,003,191,657 cycles:u # 2.310 GHz 68,927 stalled-cycles-frontend:u # 0.01% frontend cycles idle 800,792,619 stalled-cycles-backend:u # 79.82% backend cycles idle 1,904,682,933 instructions:u # 1.90 insn per cycle # 0.42 stalled cycles per insn 500,912,196 branches:u # 1.154 G/sec 23,144 branch-misses:u # 0.00% of all branches 0.435340389 seconds time elapsed 0.431409000 seconds user 0.003994000 seconds sys jan@localhost:/tmp> perf stat ./a.out-gcc Performance counter stats for './a.out-gcc': 1,197.28 msec task-clock:u # 0.999 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 131 page-faults:u # 109.415 /sec 2,903,995,656 cycles:u # 2.425 GHz 86,204 stalled-cycles-frontend:u # 0.00% frontend cycles idle 2,690,907,052 stalled-cycles-backend:u # 92.66% backend cycles idle 2,005,212,311 instructions:u # 0.69 insn per cycle # 1.34 stalled cycles per insn 401,007,320 branches:u # 334.932 M/sec 23,290 branch-misses:u # 0.01% of all branches 1.198388186 seconds time elapsed 1.198450000 seconds user 0.000000000 seconds sys The problem seems to be, like in first example, that we keep updating in-memory stack in the main loop. .L39: movl 12(%rsp), %ebx .L30: movq 16(%rsp), %rax cmpl $10000, %ebx ja .L33 .L40: movq 24(%rsp), %rdi cmpq %rdi, %rax je .L28 .L34: movq -8(%rdi), %rax leaq -8(%rdi), %rsi movq %rsi, 24(%rsp) movq %rax, 8(%rsp) testl %eax, %eax jne .L39 While clang does: .LBB0_1: # in Loop: Header=BB0_4 Depth=1 movq %rax, %r14 .LBB0_2: # in Loop: Header=BB0_4 Depth=1 movq %rbx, %r12 movq %r12, %rbx cmpl $10001, %r13d # imm = 0x2711 jae .LBB0_27 .LBB0_4: # =>This Loop Header: Depth=1 # Child Loop BB0_16 Depth 2 # Child Loop BB0_21 Depth 2 cmpq %r14, %rbx je .LBB0_26 # %bb.5: # in Loop: Header=BB0_4 Depth=1 leaq -8(%r14), %rax movq -8(%r14), %rcx movq %rcx, %r13 shrq $32, %r13 testl %ecx, %ecx jne .LBB0_1 Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811 [Bug 109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16