Hi Claudiu, Thanks for the answers to my technical questions. If you'd prefer to update arc.md's add3 pattern first, I'm happy to update/revise my patch based on this and your feedback, for example preferring add over asl_s (or controlling this choice with -Os).
Thanks again. Roger -- > -----Original Message----- > From: Claudiu Zissulescu <claudiu.zissule...@synopsys.com> > Sent: 03 October 2023 15:26 > To: Roger Sayle <ro...@nextmovesoftware.com>; gcc-patches@gcc.gnu.org > Subject: RE: [ARC PATCH] Split SImode shifts pre-reload on > !TARGET_BARREL_SHIFTER. > > Hi Roger, > > It was nice to meet you too. > > Thank you in looking into the ARC's non-Barrel Shifter configurations. I will dive > into your patch asap, but before starting here are a few of my comments: > > -----Original Message----- > From: Roger Sayle <ro...@nextmovesoftware.com> > Sent: Thursday, September 28, 2023 2:27 PM > To: gcc-patches@gcc.gnu.org > Cc: Claudiu Zissulescu <claz...@synopsys.com> > Subject: [ARC PATCH] Split SImode shifts pre-reload on > !TARGET_BARREL_SHIFTER. > > > Hi Claudiu, > It was great meeting up with you and the Synopsys ARC team at the GNU tools > Cauldron in Cambridge. > > This patch is the first in a series to improve SImode and DImode shifts and rotates > in the ARC backend. This first piece splits SImode shifts, for > !TARGET_BARREL_SHIFTER targets, after combine and before reload, in the split1 > pass, as suggested by the FIXME comment above output_shift in arc.cc. To do > this I've copied the implementation of the x86_pre_reload_split function from > i386 backend, and renamed it arc_pre_reload_split. > > Although the actual implementations of shifts remain the same (as in > output_shift), having them as explicit instructions in the RTL stream allows better > scheduling and use of compact forms when available. The benefits can be seen in > two short examples below. > > For the function: > unsigned int foo(unsigned int x, unsigned int y) { > return y << 2; > } > > GCC with -O2 -mcpu=em would previously generate: > foo: add r1,r1,r1 > add r1,r1,r1 > j_s.d [blink] > mov_s r0,r1 ;4 > > [CZI] The move shouldn't be generated indeed. The use of ADDs are slightly > beneficial for older ARCv1 arches. > > and with this patch now generates: > foo: asl_s r0,r1 > j_s.d [blink] > asl_s r0,r0 > > [CZI] Nice. This new sequence is as fast as we can get for our ARCv2 cpus. > > Notice the original (from shift_si3's output_shift) requires the shift sequence to be > monolithic with the same destination register as the source (requiring an extra > mov_s). The new version can eliminate this move, and schedule the second asl in > the branch delay slot of the return. > > For the function: > int x,y,z; > > void bar() > { > x <<= 3; > y <<= 3; > z <<= 3; > } > > GCC -O2 -mcpu=em currently generates: > bar: push_s r13 > ld.as r12,[gp,@x@sda] ;23 > ld.as r3,[gp,@y@sda] ;23 > mov r2,0 > add3 r12,r2,r12 > mov r2,0 > add3 r3,r2,r3 > ld.as r2,[gp,@z@sda] ;23 > st.as r12,[gp,@x@sda] ;26 > mov r13,0 > add3 r2,r13,r2 > st.as r3,[gp,@y@sda] ;26 > st.as r2,[gp,@z@sda] ;26 > j_s.d [blink] > pop_s r13 > > where each shift by 3, uses ARC's add3 instruction, which is similar to x86's lea > implementing x = (y<<3) + z, but requires the value zero to be placed in a > temporary register "z". Splitting this before reload allows these pseudos to be > shared/reused. With this patch, we get > > bar: ld.as r2,[gp,@x@sda] ;23 > mov_s r3,0 ;3 > add3 r2,r3,r2 > ld.as r3,[gp,@y@sda] ;23 > st.as r2,[gp,@x@sda] ;26 > ld.as r2,[gp,@z@sda] ;23 > mov_s r12,0 ;3 > add3 r3,r12,r3 > add3 r2,r12,r2 > st.as r3,[gp,@y@sda] ;26 > st.as r2,[gp,@z@sda] ;26 > j_s [blink] > > [CZI] Looks great, but it also shows that I've forgot to add to ADD3 instruction the > Ra,LIMM,RC variant, which will lead to have instead of > mov_s r3,0 ;3 > add3 r2,r3,r2 > Only this add3,0,r2, Indeed it is longer instruction but faster. > > Unfortunately, register allocation means that we only share two of the three > "mov_s z,0", but this is sufficient to reduce register pressure enough to avoid > spilling r13 in the prologue/epilogue. > > This patch also contains a (latent?) bug fix. The implementation of the default > insn "length" attribute, assumes instructions of type "shift" have two input > operands and accesses operands[2], hence specializations of shifts that don't > have a operands[2], need to be categorized as type "unary" (which results in the > correct length). > > [CZI] The ARC types need an upgrade too. > > This patch has been tested on a cross-compiler to arc-elf (hosted on x86_64-pc- > linux-gnu), but because I've an incomplete tool chain many of the regression test > fail, but there are no new failures with new test cases added below. If you can > confirm that there are no issues from additional testing, is this OK for mainline? > > Finally a quick technical question. ARC's zero overhead loops require at least two > instructions in the loop, so currently the backend's implementation of shr20 pads > the loop body with a "nop". > > lshr20: mov.f lp_count, 20 > lpnz 2f > lsr r0,r0 > nop > 2: # end single insn loop > j_s [blink] > > > [CZI] The ZOLs (LP instructions) are not great when dealing with short loop blocks. > Hence, the NOP instruction. Personally, I don't fancy using the LP instruction in > this case, as it prohibits LP usage for a true for-loop. > > could this be more efficiently implemented as: > > lshr20: mov lp_count, 10 > lp 2f > lsr_s r0,r0 > lsr_s r0,r0 > 2: # end single insn loop > j_s [blink] > > i.e. half the number of iterations, but doing twice as much useful work in each > iteration? Or might the nop be free on advanced microarchitectures, and/or the > consecutive dependent shifts cause a pipeline stall? It would be nice to fuse loops > to implement rotations, such that rotr16 (aka swap) would look like: > > rot16: mov_s r1,r0 > mov lp_count, 16 > lp 2f > asl_s r0,r0 > lsr_s r1,r1 > 2: # end single insn loop > j_s.d [blink] > or_s r0,r0,r1 > > Thanks in advance, > Roger > > > 2023-09-28 Roger Sayle <ro...@nextmovesoftware.com> > > gcc/ChangeLog > * config/arc/arc-protos.h (emit_shift): Delete prototype. > (arc_pre_reload_split): New function prototype. > * config/arc/arc.cc (emit_shift): Delete function. > (arc_pre_reload_split): New predicate function, copied from i386, > to schedule define_insn_and_split splitters to the split1 pass. > * config/arc/arc.md (ashlsi3): Expand RTL template unconditionally. > (ashrsi3): Likewise. > (lshrsi3): Likewise. > (shift_si3): Move after other shift patterns, and disable when > operands[2] is one (which is handled by its own define_insn). > Use shiftr4_operator, instead of shift4_operator, as this is no > longer used for left shifts. > (shift_si3_loop): Likewise. Additionally remove match_scratch. > (*ashlsi3_nobs): New pre-reload define_insn_and_split. > (*ashrsi3_nobs): Likewise. > (*lshrsi3_nobs): Likewise. > (rotrsi3_cnt1): Rename define_insn from *rotrsi3_cnt1. > (ashlsi3_cnt1): Rename define_insn from *ashlsi2_cnt1. Change > insn type attribute to "unary", as this doesn't have operands[2]. > Change length attribute to "*,4" to allow compact representation. > (lshrsi3_cnt1): Rename define_insn from *lshrsi3_cnt1. Change > insn type attribute to "unary", as this doesn't have operands[2]. > (ashrsi3_cnt1): Rename define_insn from *ashrsi3_cnt1. Change > insn type attribute to "unary", as this doesn't have operands[2]. > (add_shift): Rename define_insn from *add_shift. > * config/arc/predicates.md (shiftl4_operator): Delete. > (shift4_operator): Delete. > > gcc/testsuite/ChangeLog > * gcc.target/arc/ashrsi-1.c: New TARGET_BARREL_SHIFTER test case. > * gcc.target/arc/ashrsi-2.c: New !TARGET_BARREL_SHIFTER test case. > * gcc.target/arc/ashrsi-3.c: Likewise. > * gcc.target/arc/ashrsi-4.c: Likewise. > * gcc.target/arc/ashrsi-5.c: Likewise. > * gcc.target/arc/lshrsi-1.c: New TARGET_BARREL_SHIFTER test case. > * gcc.target/arc/lshrsi-2.c: New !TARGET_BARREL_SHIFTER test case. > * gcc.target/arc/lshrsi-3.c: Likewise. > * gcc.target/arc/lshrsi-4.c: Likewise. > * gcc.target/arc/lshrsi-5.c: Likewise. > * gcc.target/arc/shlsi-1.c: New TARGET_BARREL_SHIFTER test case. > * gcc.target/arc/shlsi-2.c: New !TARGET_BARREL_SHIFTER test case. > * gcc.target/arc/shlsi-3.c: Likewise. > * gcc.target/arc/shlsi-4.c: Likewise. > * gcc.target/arc/shlsi-5.c: Likewise.