https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91154
--- Comment #12 from Richard Biener <rguenth at gcc dot gnu.org> --- On Skylake (Coffeelake actually) with the same binary (built for Haswell), the improvement is down to 5%. On Haswell, when I just replace the second conditional move like vmovd %ebx, %xmm12 .p2align 4,,10 .p2align 3 .L34: ... cmpl %eax, %esi cmovge %esi, %eax movl %ecx, %esi # cmpl %ebx, %eax # cmovl %ebx, %eax vmovd %eax, %xmm10 vpmaxsd %xmm12, %xmm10, %xmm10 vmovd %xmm10, %eax movl %eax, -4(%r13,%rcx,4) ... this doesn't make a difference. Replacing both, like movl -8(%r8,%rcx,4), %esi addl -8(%rdx,%rcx,4), %esi # cmpl %eax, %esi # cmovge %esi, %eax vmovd %eax, %xmm10 vmovd %esi, %xmm11 vpmaxsd %xmm11, %xmm10, %xmm10 movl %ecx, %esi # cmpl %ebx, %eax # cmovl %ebx, %eax vpmaxsd %xmm12, %xmm10, %xmm10 vmovd %xmm10, %eax movl %eax, -4(%r13,%rcx,4) makes runtime improve to within 1% of fixing the regression. I guess that's the best a insn-localized "fix" (provide a smax pattern for SImode) would get to here. As expected on Zen this localized "fix" is a loss (additional 11% regression, tested same Haswell tuned binary) while the full SSE variant is also a lot faster - 35% so compared to the code with two cmovs and 17% compared to the good r272921 code. So it seems important to avoid crossing the gpr/xmm domain here. When I go through the stack for all three GPR<->XMM moves the situation gets even (much) worse. Overall this means enhancing STV to seed itself on conditional moves that match max/min _and_ that can avoid GPR <-> XMM moves would be quite a big win here.