On Sun, Jun 30, 2024 at 9:09 PM Roger Sayle <ro...@nextmovesoftware.com> wrote: > > > Hi Uros, > > On Sat, Jun 29, 2024 at 6:21 PM Roger Sayle <ro...@nextmovesoftware.com> > > wrote: > > > A common idiom for implementing an integer division that rounds > > > upwards is to write (x + y - 1) / y. Conveniently on x86, the two > > > additions to form the numerator can be performed by a single lea > > > instruction, and indeed gcc currently generates a lea when x and y both > > registers. > > > > > > int foo(int x, int y) { > > > return (x+y-1)/y; > > > } > > > > > > generates with -O2: > > > > > > foo: leal -1(%rsi,%rdi), %eax // 4 bytes > > > cltd > > > idivl %esi > > > ret > > > > > > Oddly, however, if x is a memory, gcc currently uses two instructions: > > > > > > int m; > > > int bar(int y) { > > > return (m+y-1)/y; > > > } > > > > > > generates: > > > > > > foo: movl m(%rip), %eax > > > addl %edi, %eax // 2 bytes > > > subl $1, %eax // 3 bytes > > > cltd > > > idivl %edi > > > ret > > > > > > This discrepancy is caused by the late decision (in peephole2) to > > > split an addition with a memory operand, into a load followed by a > > > reg-reg addition. This patch improves this situation by adding a > > > peephole2 to recognized consecutive additions and transform them into > > > lea if profitable. > > > > > > My first attempt at fixing this was to use a define_insn_and_split: > > > > > > (define_insn_and_split "*lea<mode>3_reg_mem_imm" > > > [(set (match_operand:SWI48 0 "register_operand") > > > (plus:SWI48 (plus:SWI48 (match_operand:SWI48 1 "register_operand") > > > (match_operand:SWI48 2 "memory_operand")) > > > (match_operand:SWI48 3 "x86_64_immediate_operand")))] > > > "ix86_pre_reload_split ()" > > > "#" > > > "&& 1" > > > [(set (match_dup 4) (match_dup 2)) > > > (set (match_dup 0) (plus:SWI48 (plus:SWI48 (match_dup 1) (match_dup 4)) > > > (match_dup 3)))] > > > "operands[4] = gen_reg_rtx (<MODE>mode);") > > > > > > using combine to combine instructions. Unfortunately, this approach > > > interferes with (reload's) subtle balance of deciding when to > > > use/avoid lea, which can be observed as a code size regression in > > > CSiBE. The peephole2 approach (proposed here) uniformly improves CSiBE > > results. > > > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > > and make -k check, both with and without --target_board=unix{-m32} > > > with no new failures. Ok for mainline? > > > > > > > > > 2024-06-29 Roger Sayle <ro...@nextmovesoftware.com> > > > > > > gcc/ChangeLog > > > * config/i386/i386.md (peephole2): Transform two consecutive > > > additions into a 3-component lea if !TARGET_AVOID_LEA_FOR_ADDR. > > > > > > gcc/testsuite/ChageLog > > > * gcc.target/i386/lea-3.c: New test case. > > > > Is the assumption that one LEA is always faster than two ADD instructions > > universally correct for TARGET_AVOID_LEA_FOR_ADDR? > > > > Please note ix86_lea_outperforms predicate and its uses in > > ix86_avoid_lea_for_add(), ix86_use_lea_for_mov(), > > ix86_avoid_lea_for_addr() and ix86_lea_for_add_ok(). IMO, > > !avoid_lea_for_addr() should be used here, but I didn't check it thoroughly. > > > > The function comment of avoid_lea_for_addr() says: > > > > /* Return true if we need to split lea into a sequence of > > instructions to avoid AGU stalls during peephole2. */ > > > > And your peephole tries to reverse the above split. > > I completely agree that understanding when/why i386.md converts > an lea into a sequence of additions (and avoiding reversing this split) > is vitally important to understanding my patch. You're quite right that > the logic governing this ultimately calls ix86_lea_outperforms, but as > I'll explain below the shape of those APIs (requiring an insn) is not as > convenient for instruction merging as for splitting. > > The current location in i386.md where it decides whether the > lea in the foo example above needs to be split, is at line 6293: > > (define_peephole2 > [(set (match_operand:SWI48 0 "register_operand") > (match_operand:SWI48 1 "address_no_seg_operand"))] > "ix86_hardreg_mov_ok (operands[0], operands[1]) > && peep2_regno_dead_p (0, FLAGS_REG) > && ix86_avoid_lea_for_addr (peep2_next_insn (0), operands)" > ... > > Hence, we transform lea->add+add when ix86_avoid_lea_for_addr > returns true, so by symmetry is not unreasonable to turn add+add->lea > when ix86_avoid_lea_for_addr would return false. The relevant part > of ix86_avoid_lea_for_addr is then around line 15974 of i386.cc: > > /* Check we need to optimize. */ > if (!TARGET_AVOID_LEA_FOR_ADDR || optimize_function_for_size_p (cfun)) > return false; > > which you'll recognize is precisely the condition under which my > proposed peephole2 fires. Technically, we also know that this is > a 3-component lea, "parts.disp" is non-zero, the addition is destructive > (overwriting one of the input register operands), and that we're not > using partial register mode (such as HImode). > > I won't argue that my peephole2 is "obviously" correct, but the observation > that it correctly reduces code size when -Os is specified, indicates that > something is broken/inconsistent/missing with the current logic, and > this change is an improvement. > > Perhaps the TARGET_AVOID_LEA_FOR_ADDR in ix86_avoid_lead_for_addr > is incorrect, if so please feel free to file a bugzilla PR. > > Ok for mainline?
OK. Thanks, Uros.