On Thu, Jul 30, 2020 at 1:23 PM Roger Sayle <ro...@nextmovesoftware.com> wrote: > > > This patch tweaks the way that min and max are expanded, so that the > semantics of these operations are visible to the early RTL optimization > passes, until split into explicit comparison and conditional move > instructions. The good news is that i386.md already contains all of > the required logic (many thanks to Richard Biener and Uros Bizjak), > but this is currently only to enable scalar-to-vector (STV) synthesis > of min/max instructions. This change enables this functionality for > all TARGET_CMOVE architectures for SImode, HImode and DImode. > > My first attempt to support "min(reg,reg)" as an instruction revealed > why this functionality isn't already enabled: In PR rtl-optimization 91154 > we end up generating a cmp instead of a test in gcc.target/i386/minmax-3.c > which has poor performance on AMD Opteron. My solution to this is to > actually support "min(reg,general_operand)" allowing us to inspect > any immediate constant at the point this operation is split. This > allows us to use "test" instructions for min/max against 0, 1 and -1. > As an added benefit it allows us to CSE large immediate constants, > reducing code size. > > Previously, "int foo(int x) { return max(x,12345); }" would generate: > > foo: cmpl $12345, %edi > movl $12345, %eax > cmovge %edi, %eax > ret > > with this patch we instead generate: > > foo: movl $12345, %eax > cmpl %eax, %edi > cmovge %edi, %eax > ret > > > I've also included a peephole2 to avoid the "movl $0,%eax" instructions > being produced by the register allocator. Materializing constants as > late as possible reduces register pressure, but for const0_rtx on x86, > it's preferable to use "xor" by moving this set from between a flags > setting operation and its use. This should also help instruction macro > fusion on some microarchitectures, where test/cmp and the following > instruction can sometimes be combined. > > Previously, "int foo(int x) { return max(x,0); }" would generate: > > foo: testl %edi, %edi > movl $0, %eax > cmovns %edi, %eax > ret > > with this patch we instead generate: > foo: xorl %eax, %eax > testl %edi, %edi > cmovns %edi, %eax > ret > > The benefits of having min/max explicit at the RTL level can be seen > from compiling the following example with "-O2 -fno-tree-reassoc". > > > #define max(a,b) (((a) > (b))? (a) : (b)) > > int foo(int x) > { > int y = max(x,5); > return max(y,10); > } > > where GCC currently produces: > > foo: cmpl $5, %edi > movl $5, %eax > movl $10, %edx > cmovge %edi, %eax > cmpl $10, %eax > cmovl %edx, %eax > ret > > and with this patch it instead now produces: > > foo: movl $10, %eax > cmpl %eax, %edi > cmovge %edi, %eax > ret > > > The original motivation was from looking at a performance critical > function in a quantum mechanics code, that performed MIN_EXPR and > MAX_EXPR of the same arguments (effectively a two-element sort), > where GCC was performing the comparison twice. I'd hoped that it > might be possible to fuse these together, perhaps in combine, but > this patch is an intermediate step towards that goal. > > This patch has been tested on x86_64-pc-linux-gnu with a make > bootstrap followed by make -k check with no new regressions. > Ok for mainline? > > > 2020-07-30 Roger Sayle <ro...@nextmovesoftware.com> > > * config/i386/i386.md (MAXMIN_IMODE): No longer needed. > (<maxmin><mode>3): Support SWI248 and general_operand for > second operand, when TARGET_CMOVE. > (<maxmin><mode>3_1 splitter): Optimize comparisons against > 0, 1 and -1 to use "test" instead of "cmp". > (*<maxmin>di3_doubleword): Likewise, allow general_operand > and enable on TARGET_CMOVE. > (peephole2): Convert clearing a register after a flag setting > instruction into an xor followed by the original flag setter. > > > Many thanks in advance. Almost all of the credit goes to Uros and > Richard for already implementing this feature, I've just copied the > transformations from optab expansion, that allow it to be enabled > without penalty (this late in the compilation).
LGTM, modulo the below part: +;; Avoid clearing a register between a flags setting comparison and its use, +;; i.e. prefer "xorl %eax,%eax; test/cmp" over "test/cmp; movl $0, %eax". +(define_peephole2 + [(set (reg FLAGS_REG) (match_operand 0)) + (set (match_operand:SWI48 1 "register_operand") (const_int 0))] + "peep2_regno_dead_p (0, FLAGS_REG) + && !reg_overlap_mentioned_p (operands[1], operands[0])" + [(parallel [(set (match_dup 1) (const_int 0)) + (clobber (reg:CC FLAGS_REG))]) + (set (match_dup 2) (match_dup 0))] +{ + operands[2] = gen_rtx_REG (GET_MODE (operands[0]), FLAGS_REG); +}) Please use ix86_expand_clear (operands[1]); in the preparation statement instead of emitting RTX pattern. This will take TARGET_USE_MOV0 and size optimizations into account. Uros.