Re: [x86 PATCH]: Additional peephole2 to use lea in round-up integer division.

Uros Bizjak Sun, 30 Jun 2024 12:21:03 -0700

On Sun, Jun 30, 2024 at 9:09 PM Roger Sayle <ro...@nextmovesoftware.com> wrote:
>
>
> Hi Uros,
> > On Sat, Jun 29, 2024 at 6:21 PM Roger Sayle <ro...@nextmovesoftware.com>
> > wrote:
> > > A common idiom for implementing an integer division that rounds
> > > upwards is to write (x + y - 1) / y.  Conveniently on x86, the two
> > > additions to form the numerator can be performed by a single lea
> > > instruction, and indeed gcc currently generates a lea when x and y both
> > registers.
> > >
> > > int foo(int x, int y) {
> > >   return (x+y-1)/y;
> > > }
> > >
> > > generates with -O2:
> > >
> > > foo:    leal    -1(%rsi,%rdi), %eax     // 4 bytes
> > >         cltd
> > >         idivl   %esi
> > >         ret
> > >
> > > Oddly, however, if x is a memory, gcc currently uses two instructions:
> > >
> > > int m;
> > > int bar(int y) {
> > >   return (m+y-1)/y;
> > > }
> > >
> > > generates:
> > >
> > > foo:    movl    m(%rip), %eax
> > >         addl    %edi, %eax              // 2 bytes
> > >         subl    $1, %eax                // 3 bytes
> > >         cltd
> > >         idivl   %edi
> > >         ret
> > >
> > > This discrepancy is caused by the late decision (in peephole2) to
> > > split an addition with a memory operand, into a load followed by a
> > > reg-reg addition.  This patch improves this situation by adding a
> > > peephole2 to recognized consecutive additions and transform them into
> > > lea if profitable.
> > >
> > > My first attempt at fixing this was to use a define_insn_and_split:
> > >
> > > (define_insn_and_split "*lea<mode>3_reg_mem_imm"
> > >   [(set (match_operand:SWI48 0 "register_operand")
> > >        (plus:SWI48 (plus:SWI48 (match_operand:SWI48 1 "register_operand")
> > >                                (match_operand:SWI48 2 "memory_operand"))
> > >                    (match_operand:SWI48 3 "x86_64_immediate_operand")))]
> > >   "ix86_pre_reload_split ()"
> > >   "#"
> > >   "&& 1"
> > >   [(set (match_dup 4) (match_dup 2))
> > >    (set (match_dup 0) (plus:SWI48 (plus:SWI48 (match_dup 1) (match_dup 4))
> > >                                  (match_dup 3)))]
> > >   "operands[4] = gen_reg_rtx (<MODE>mode);")
> > >
> > > using combine to combine instructions.  Unfortunately, this approach
> > > interferes with (reload's) subtle balance of deciding when to
> > > use/avoid lea, which can be observed as a code size regression in
> > > CSiBE.  The peephole2 approach (proposed here) uniformly improves CSiBE
> > results.
> > >
> > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > > and make -k check, both with and without --target_board=unix{-m32}
> > > with no new failures.  Ok for mainline?
> > >
> > >
> > > 2024-06-29  Roger Sayle  <ro...@nextmovesoftware.com>
> > >
> > > gcc/ChangeLog
> > >         * config/i386/i386.md (peephole2): Transform two consecutive
> > >         additions into a 3-component lea if !TARGET_AVOID_LEA_FOR_ADDR.
> > >
> > > gcc/testsuite/ChageLog
> > >         * gcc.target/i386/lea-3.c: New test case.
> >
> > Is the assumption that one LEA is always faster than two ADD instructions
> > universally correct for TARGET_AVOID_LEA_FOR_ADDR?
> >
> > Please note ix86_lea_outperforms predicate and its uses in
> > ix86_avoid_lea_for_add(), ix86_use_lea_for_mov(),
> > ix86_avoid_lea_for_addr() and ix86_lea_for_add_ok(). IMO,
> > !avoid_lea_for_addr() should be used here, but I didn't check it thoroughly.
> >
> > The function comment of avoid_lea_for_addr() says:
> >
> > /* Return true if we need to split lea into a sequence of
> >    instructions to avoid AGU stalls during peephole2. */
> >
> > And your peephole tries to reverse the above split.
>
> I completely agree that understanding when/why i386.md converts
> an lea into a sequence of additions (and avoiding reversing this split)
> is vitally important to understanding my patch.  You're quite right that
> the logic governing this ultimately calls ix86_lea_outperforms, but as
> I'll explain below the shape of those APIs (requiring an insn) is not as
> convenient for instruction merging as for splitting.
>
> The current location in i386.md where it decides whether the
> lea in the foo example above needs to be split, is at line 6293:
>
> (define_peephole2
>   [(set (match_operand:SWI48 0 "register_operand")
>     (match_operand:SWI48 1 "address_no_seg_operand"))]
>   "ix86_hardreg_mov_ok (operands[0], operands[1])
>    && peep2_regno_dead_p (0, FLAGS_REG)
>    && ix86_avoid_lea_for_addr (peep2_next_insn (0), operands)"
> ...
>
> Hence, we transform lea->add+add when ix86_avoid_lea_for_addr
> returns true, so by symmetry is not unreasonable to turn add+add->lea
> when ix86_avoid_lea_for_addr would return false.  The relevant part
> of ix86_avoid_lea_for_addr is then around line 15974 of i386.cc:
>
>   /* Check we need to optimize.  */
>   if (!TARGET_AVOID_LEA_FOR_ADDR || optimize_function_for_size_p (cfun))
>     return false;
>
> which you'll recognize is precisely the condition under which my
> proposed peephole2 fires.  Technically, we also know that this is
> a 3-component lea, "parts.disp" is non-zero, the addition is destructive
> (overwriting one of the input register operands), and that we're not
> using partial register mode (such as HImode).
>
> I won't argue that my peephole2 is "obviously" correct, but the observation
> that it correctly reduces code size when -Os is specified, indicates that
> something is broken/inconsistent/missing with the current logic, and
> this change is an improvement.
>
> Perhaps the TARGET_AVOID_LEA_FOR_ADDR in ix86_avoid_lead_for_addr
> is incorrect, if so please feel free to file a bugzilla PR.
>
> Ok for mainline?


OK.

Thanks,
Uros.

Re: [x86 PATCH]: Additional peephole2 to use lea in round-up integer division.

Reply via email to