[Bug middle-end/109849] suboptimal code for vector walking loop

hubicka at gcc dot gnu.org via Gcc-bugs Wed, 17 May 2023 07:53:43 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849


Jan Hubicka <hubicka at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Blocks|                            |109811
                 CC|                            |mjambor at suse dot cz

--- Comment #6 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Here is slightly improved testcase which actually pushes into stack and
measures something. It test loops 1000 times and returns.  It also makes stack
to be local variable so race conditions are not a problem.

#include <vector>
typedef unsigned int uint32_t;
std::pair<uint32_t, uint32_t> pair;
void
test()
{
        std::vector<std::pair<uint32_t, uint32_t>> stack;
        stack.push_back (pair);
        while (!stack.empty()) {
                std::pair<uint32_t, uint32_t> cur = stack.back();
                stack.pop_back();
                if (!cur.first)
                {
                        cur.second++;
                        stack.push_back (cur);
                }
                if (cur.second > 10000)
                        break;
        }
}
int
main()
{
        for (int i = 0; i < 10000; i++)
          test();
}

Clang code is about twice as fast

jan@localhost:/tmp> clang++ -O2 tt.C  -fno-exceptions
jan@localhost:/tmp> g++ -O2 tt.C  -fno-exceptions -o a.out-gcc
jan@localhost:/tmp> perf stat ./a.out

 Performance counter stats for './a.out':

            434.24 msec task-clock:u                     #    0.997 CPUs
utilized             
                 0      context-switches:u               #    0.000 /sec        
                 0      cpu-migrations:u                 #    0.000 /sec        
               129      page-faults:u                    #  297.073 /sec        
     1,003,191,657      cycles:u                         #    2.310 GHz         
            68,927      stalled-cycles-frontend:u        #    0.01% frontend
cycles idle      
       800,792,619      stalled-cycles-backend:u         #   79.82% backend
cycles idle       
     1,904,682,933      instructions:u                   #    1.90  insn per
cycle            
                                                  #    0.42  stalled cycles per
insn   
       500,912,196      branches:u                       #    1.154 G/sec       
            23,144      branch-misses:u                  #    0.00% of all
branches           

       0.435340389 seconds time elapsed

       0.431409000 seconds user
       0.003994000 seconds sys


jan@localhost:/tmp> perf stat ./a.out-gcc

 Performance counter stats for './a.out-gcc':

          1,197.28 msec task-clock:u                     #    0.999 CPUs
utilized             
                 0      context-switches:u               #    0.000 /sec        
                 0      cpu-migrations:u                 #    0.000 /sec        
               131      page-faults:u                    #  109.415 /sec        
     2,903,995,656      cycles:u                         #    2.425 GHz         
            86,204      stalled-cycles-frontend:u        #    0.00% frontend
cycles idle      
     2,690,907,052      stalled-cycles-backend:u         #   92.66% backend
cycles idle       
     2,005,212,311      instructions:u                   #    0.69  insn per
cycle            
                                                  #    1.34  stalled cycles per
insn   
       401,007,320      branches:u                       #  334.932 M/sec       
            23,290      branch-misses:u                  #    0.01% of all
branches           

       1.198388186 seconds time elapsed

       1.198450000 seconds user
       0.000000000 seconds sys


The problem seems to be, like in first example, that we keep updating in-memory
stack in the main loop.

.L39:
        movl    12(%rsp), %ebx
.L30:
        movq    16(%rsp), %rax
        cmpl    $10000, %ebx
        ja      .L33
.L40:
        movq    24(%rsp), %rdi
        cmpq    %rdi, %rax
        je      .L28
.L34:
        movq    -8(%rdi), %rax
        leaq    -8(%rdi), %rsi
        movq    %rsi, 24(%rsp)
        movq    %rax, 8(%rsp)
        testl   %eax, %eax
        jne     .L39

While clang does:

.LBB0_1:                                #   in Loop: Header=BB0_4 Depth=1
        movq    %rax, %r14
.LBB0_2:                                #   in Loop: Header=BB0_4 Depth=1
        movq    %rbx, %r12
        movq    %r12, %rbx
        cmpl    $10001, %r13d                   # imm = 0x2711
        jae     .LBB0_27
.LBB0_4:                                # =>This Loop Header: Depth=1
                                        #     Child Loop BB0_16 Depth 2
                                        #     Child Loop BB0_21 Depth 2
        cmpq    %r14, %rbx
        je      .LBB0_26
# %bb.5:                                #   in Loop: Header=BB0_4 Depth=1
        leaq    -8(%r14), %rax
        movq    -8(%r14), %rcx
        movq    %rcx, %r13
        shrq    $32, %r13
        testl   %ecx, %ecx
        jne     .LBB0_1


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811
[Bug 109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16

[Bug middle-end/109849] suboptimal code for vector walking loop

Reply via email to