Hi Claudiu,
Thanks for the answers to my technical questions.
If you'd prefer to update arc.md's add3 pattern first,
I'm happy to update/revise my patch based on this
and your feedback, for example preferring add over
asl_s (or controlling this choice with -Os).

Thanks again.
Roger
--

> -----Original Message-----
> From: Claudiu Zissulescu <claudiu.zissule...@synopsys.com>
> Sent: 03 October 2023 15:26
> To: Roger Sayle <ro...@nextmovesoftware.com>; gcc-patches@gcc.gnu.org
> Subject: RE: [ARC PATCH] Split SImode shifts pre-reload on
> !TARGET_BARREL_SHIFTER.
> 
> Hi Roger,
> 
> It was nice to meet you too.
> 
> Thank you in looking into the ARC's non-Barrel Shifter configurations.  I
will dive
> into your patch asap, but before starting here are a few of my comments:
> 
> -----Original Message-----
> From: Roger Sayle <ro...@nextmovesoftware.com>
> Sent: Thursday, September 28, 2023 2:27 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Claudiu Zissulescu <claz...@synopsys.com>
> Subject: [ARC PATCH] Split SImode shifts pre-reload on
> !TARGET_BARREL_SHIFTER.
> 
> 
> Hi Claudiu,
> It was great meeting up with you and the Synopsys ARC team at the GNU
tools
> Cauldron in Cambridge.
> 
> This patch is the first in a series to improve SImode and DImode shifts
and rotates
> in the ARC backend.  This first piece splits SImode shifts, for
> !TARGET_BARREL_SHIFTER targets, after combine and before reload, in the
split1
> pass, as suggested by the FIXME comment above output_shift in arc.cc.  To
do
> this I've copied the implementation of the x86_pre_reload_split function
from
> i386 backend, and renamed it arc_pre_reload_split.
> 
> Although the actual implementations of shifts remain the same (as in
> output_shift), having them as explicit instructions in the RTL stream
allows better
> scheduling and use of compact forms when available.  The benefits can be
seen in
> two short examples below.
> 
> For the function:
> unsigned int foo(unsigned int x, unsigned int y) {
>   return y << 2;
> }
> 
> GCC with -O2 -mcpu=em would previously generate:
> foo:    add r1,r1,r1
>         add r1,r1,r1
>         j_s.d   [blink]
>         mov_s   r0,r1   ;4
> 
> [CZI] The move shouldn't be generated indeed. The use of ADDs are slightly
> beneficial for older ARCv1 arches.
> 
> and with this patch now generates:
> foo:    asl_s r0,r1
>         j_s.d   [blink]
>         asl_s r0,r0
> 
> [CZI] Nice. This new sequence is as fast as we can get for our ARCv2 cpus.
> 
> Notice the original (from shift_si3's output_shift) requires the shift
sequence to be
> monolithic with the same destination register as the source (requiring an
extra
> mov_s).  The new version can eliminate this move, and schedule the second
asl in
> the branch delay slot of the return.
> 
> For the function:
> int x,y,z;
> 
> void bar()
> {
>   x <<= 3;
>   y <<= 3;
>   z <<= 3;
> }
> 
> GCC -O2 -mcpu=em currently generates:
> bar:    push_s  r13
>         ld.as   r12,[gp,@x@sda] ;23
>         ld.as   r3,[gp,@y@sda]  ;23
>         mov r2,0
>         add3 r12,r2,r12
>         mov r2,0
>         add3 r3,r2,r3
>         ld.as   r2,[gp,@z@sda]  ;23
>         st.as   r12,[gp,@x@sda] ;26
>         mov r13,0
>         add3 r2,r13,r2
>         st.as   r3,[gp,@y@sda]  ;26
>         st.as   r2,[gp,@z@sda]  ;26
>         j_s.d   [blink]
>         pop_s   r13
> 
> where each shift by 3, uses ARC's add3 instruction, which is similar to
x86's lea
> implementing x = (y<<3) + z, but requires the value zero to be placed in a
> temporary register "z".  Splitting this before reload allows these pseudos
to be
> shared/reused.  With this patch, we get
> 
> bar:    ld.as   r2,[gp,@x@sda]  ;23
>         mov_s   r3,0    ;3
>         add3    r2,r3,r2
>         ld.as   r3,[gp,@y@sda]  ;23
>         st.as   r2,[gp,@x@sda]  ;26
>         ld.as   r2,[gp,@z@sda]  ;23
>         mov_s   r12,0   ;3
>         add3    r3,r12,r3
>         add3    r2,r12,r2
>         st.as   r3,[gp,@y@sda]  ;26
>         st.as   r2,[gp,@z@sda]  ;26
>         j_s     [blink]
> 
> [CZI] Looks great, but it also shows that I've forgot to add to ADD3
instruction the
> Ra,LIMM,RC variant, which will lead to have instead of
>         mov_s   r3,0    ;3
>         add3    r2,r3,r2
> Only this add3,0,r2, Indeed it is longer instruction but faster.
> 
> Unfortunately, register allocation means that we only share two of the
three
> "mov_s z,0", but this is sufficient to reduce register pressure enough to
avoid
> spilling r13 in the prologue/epilogue.
> 
> This patch also contains a (latent?) bug fix.  The implementation of the
default
> insn "length" attribute, assumes instructions of type "shift" have two
input
> operands and accesses operands[2], hence specializations of shifts that
don't
> have a operands[2], need to be categorized as type "unary" (which results
in the
> correct length).
> 
> [CZI] The ARC types need an upgrade too.
> 
> This patch has been tested on a cross-compiler to arc-elf (hosted on
x86_64-pc-
> linux-gnu), but because I've an incomplete tool chain many of the
regression test
> fail, but there are no new failures with new test cases added below.  If
you can
> confirm that there are no issues from additional testing, is this OK for
mainline?
> 
> Finally a quick technical question.  ARC's zero overhead loops require at
least two
> instructions in the loop, so currently the backend's implementation of
shr20 pads
> the loop body with a "nop".
> 
> lshr20: mov.f lp_count, 20
>         lpnz    2f
>         lsr r0,r0
>         nop
> 2:      # end single insn loop
>         j_s     [blink]
> 
> 
> [CZI] The ZOLs (LP instructions) are not great when dealing with short
loop blocks.
> Hence, the NOP instruction. Personally, I don't fancy using the LP
instruction in
> this case, as it prohibits LP usage for a true for-loop.
> 
> could this be more efficiently implemented as:
> 
> lshr20: mov lp_count, 10
>         lp 2f
>         lsr_s r0,r0
>         lsr_s r0,r0
> 2:      # end single insn loop
>         j_s     [blink]
> 
> i.e. half the number of iterations, but doing twice as much useful work in
each
> iteration?  Or might the nop be free on advanced microarchitectures,
and/or the
> consecutive dependent shifts cause a pipeline stall?  It would be nice to
fuse loops
> to implement rotations, such that rotr16 (aka swap) would look like:
> 
> rot16:  mov_s r1,r0
>         mov lp_count, 16
>         lp 2f
>         asl_s r0,r0
>         lsr_s r1,r1
> 2:      # end single insn loop
>         j_s.d    [blink]
>         or_s r0,r0,r1
> 
> Thanks in advance,
> Roger
> 
> 
> 2023-09-28  Roger Sayle  <ro...@nextmovesoftware.com>
> 
> gcc/ChangeLog
>         * config/arc/arc-protos.h (emit_shift): Delete prototype.
>         (arc_pre_reload_split): New function prototype.
>         * config/arc/arc.cc (emit_shift): Delete function.
>         (arc_pre_reload_split): New predicate function, copied from i386,
>         to schedule define_insn_and_split splitters to the split1 pass.
>         * config/arc/arc.md (ashlsi3): Expand RTL template
unconditionally.
>         (ashrsi3): Likewise.
>         (lshrsi3): Likewise.
>         (shift_si3): Move after other shift patterns, and disable when
>         operands[2] is one (which is handled by its own define_insn).
>         Use shiftr4_operator, instead of shift4_operator, as this is no
>         longer used for left shifts.
>         (shift_si3_loop): Likewise.  Additionally remove match_scratch.
>         (*ashlsi3_nobs): New pre-reload define_insn_and_split.
>         (*ashrsi3_nobs): Likewise.
>         (*lshrsi3_nobs): Likewise.
>         (rotrsi3_cnt1): Rename define_insn from *rotrsi3_cnt1.
>         (ashlsi3_cnt1): Rename define_insn from *ashlsi2_cnt1.  Change
>         insn type attribute to "unary", as this doesn't have operands[2].
>         Change length attribute to "*,4" to allow compact representation.
>         (lshrsi3_cnt1): Rename define_insn from *lshrsi3_cnt1.  Change
>         insn type attribute to "unary", as this doesn't have operands[2].
>         (ashrsi3_cnt1): Rename define_insn from *ashrsi3_cnt1.  Change
>         insn type attribute to "unary", as this doesn't have operands[2].
>         (add_shift): Rename define_insn from *add_shift.
>         * config/arc/predicates.md (shiftl4_operator): Delete.
>         (shift4_operator): Delete.
> 
> gcc/testsuite/ChangeLog
>         * gcc.target/arc/ashrsi-1.c: New TARGET_BARREL_SHIFTER test case.
>         * gcc.target/arc/ashrsi-2.c: New !TARGET_BARREL_SHIFTER test case.
>         * gcc.target/arc/ashrsi-3.c: Likewise.
>         * gcc.target/arc/ashrsi-4.c: Likewise.
>         * gcc.target/arc/ashrsi-5.c: Likewise.
>         * gcc.target/arc/lshrsi-1.c: New TARGET_BARREL_SHIFTER test case.
>         * gcc.target/arc/lshrsi-2.c: New !TARGET_BARREL_SHIFTER test case.
>         * gcc.target/arc/lshrsi-3.c: Likewise.
>         * gcc.target/arc/lshrsi-4.c: Likewise.
>         * gcc.target/arc/lshrsi-5.c: Likewise.
>         * gcc.target/arc/shlsi-1.c: New TARGET_BARREL_SHIFTER test case.
>         * gcc.target/arc/shlsi-2.c: New !TARGET_BARREL_SHIFTER test case.
>         * gcc.target/arc/shlsi-3.c: Likewise.
>         * gcc.target/arc/shlsi-4.c: Likewise.
>         * gcc.target/arc/shlsi-5.c: Likewise.


Reply via email to