[Bug tree-optimization/120647] [X86] Sub optimal code generated for counting the number matches between two array elements

rguenth at gcc dot gnu.org via Gcc-bugs Thu, 19 Jun 2025 09:48:56 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120647


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|rtl-optimization            |tree-optimization
   Last reconfirmed|                            |2025-06-19
           Keywords|                            |missed-optimization
             Status|UNCONFIRMED                 |NEW
             Blocks|                            |53947
             Target|                            |x86_64-*-*
     Ever confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed.  Conditional reduction could have a special case where popcount on
the condition mask is available.  In principle the generated code isn't that
bad - but we are using unpacking of the mask from the vec<char> compare to
perform a .COND_ADD of vec<int>.  It might be more efficient to unpack a
vec<char> of zeros or ones to add to four IVs or in the case of constant niters
(48 here),
choose a narrower counting IV (char) and only reduce to an int in the epilogue.
That would get you the following when there's no popcount.  How vector masks
transfer to GPRs is a bit iffy at the moment (but it would work).

vector_comparison:
.LFB0:
        .cfi_startproc
        vmovdqu8        (%rsi), %ymm1
        vpcmpeqd        %ymm0, %ymm0, %ymm0
        vpcmpeqb        (%rdi), %ymm1, %k1
        vpabsb  %ymm0, %ymm1{%k1}{z}
        vmovdqa %xmm1, %xmm2
        vextracti32x4   $0x1, %ymm1, %xmm1
        vpaddb  %xmm1, %xmm2, %xmm2
        vmovdqu8        32(%rsi), %xmm1
        vpcmpeqb        32(%rdi), %xmm1, %k1
        vpabsb  %xmm0, %xmm1
        vpaddb  %xmm1, %xmm2, %xmm2{%k1}
        vpsrldq $8, %xmm2, %xmm1
        vpaddb  %xmm1, %xmm2, %xmm0
        vpxor   %xmm1, %xmm1, %xmm1
        vpsadbw %xmm1, %xmm0, %xmm0
        vpextrb $0, %xmm0, %eax
        movsbl  %al, %eax


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

[Bug tree-optimization/120647] [X86] Sub optimal code generated for counting the number matches between two array elements

Reply via email to