https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91154

--- Comment #12 from Richard Biener <rguenth at gcc dot gnu.org> ---
On Skylake (Coffeelake actually) with the same binary (built for Haswell), the
improvement is down to 5%.

On Haswell, when I just replace the second conditional move like

        vmovd  %ebx, %xmm12
        .p2align 4,,10
        .p2align 3
.L34:
...
        cmpl    %eax, %esi
        cmovge  %esi, %eax
        movl    %ecx, %esi
#       cmpl    %ebx, %eax
#       cmovl   %ebx, %eax
        vmovd   %eax, %xmm10
        vpmaxsd %xmm12, %xmm10, %xmm10
        vmovd   %xmm10, %eax
        movl    %eax, -4(%r13,%rcx,4)
...

this doesn't make a difference.  Replacing both, like

        movl    -8(%r8,%rcx,4), %esi
        addl    -8(%rdx,%rcx,4), %esi
#       cmpl    %eax, %esi
#       cmovge  %esi, %eax
        vmovd   %eax, %xmm10
        vmovd   %esi, %xmm11
        vpmaxsd %xmm11, %xmm10, %xmm10
        movl    %ecx, %esi
#       cmpl    %ebx, %eax
#       cmovl   %ebx, %eax
        vpmaxsd %xmm12, %xmm10, %xmm10
        vmovd   %xmm10, %eax
        movl    %eax, -4(%r13,%rcx,4)

makes runtime improve to within 1% of fixing the regression.
I guess that's the best a insn-localized "fix" (provide a smax pattern for
SImode) would get to here.  As expected on Zen this localized "fix" is a loss
(additional 11% regression, tested same Haswell tuned binary) while
the full SSE variant is also a lot faster - 35% so compared to the code
with two cmovs and 17% compared to the good r272921 code.

So it seems important to avoid crossing the gpr/xmm domain here.  When
I go through the stack for all three GPR<->XMM moves the situation
gets even (much) worse.

Overall this means enhancing STV to seed itself on conditional moves
that match max/min _and_ that can avoid GPR <-> XMM moves would be quite
a big win here.

Reply via email to