Thanks. I was looking at bfin. MT's implementation looks similar but simpler.
> -----Original Message----- > From: Ramana Radhakrishnan [mailto:[EMAIL PROTECTED] > Sent: 16 July 2008 19:17 > To: Bingfeng Mei > Cc: gcc@gcc.gnu.org > Subject: Re: Question about doloop_end pattern > > Hi Bingfeng, > > > Hello, > > I tried to use doloop_end pattern to reduce loop overhead > for our target > > processor, which features a dedicated loop instruction. > Somehow even a > > simple loop just cannot pass the test of doloop_condition_get, which > > requires following canonical pattern. > > > I checked this on our private port of GCC . This is based off 4.3 > branch which is off what we are working on right now . We do use the > doloop pattern to generate out these cases in our port and I can > confirm that for our case we generate the following bit of code. Our > tree does have a few other tweaks that we maintain that we'd like to > contribute once the copyright assignments are in place. > > Unroll: > c2c $c5,$c2 > i2cs $c4,63 > .L2: > ldw $c2,($c5)+=1 > add $c2,$c1,$c2 > stw ($c3)+=1,$c2 > brinzdec $c4,.L2 > brz $zero,$link > > You probably want to see the mt backend for some example as to how to > do it . It looks similar to how we do it in ours. > > > cheers > Ramana > > ---- > Ramana Radhakrishnan > Icera Semiconductor > > On Wed, Jul 16, 2008 at 12:05 PM, Bingfeng Mei > <[EMAIL PROTECTED]> wrote: > > Hello, > > I tried to use doloop_end pattern to reduce loop overhead > for our target > > processor, which features a dedicated loop instruction. > Somehow even a > > simple loop just cannot pass the test of doloop_condition_get, which > > requires following canonical pattern. > > > > /* The canonical doloop pattern we expect has one of the following > > forms: > > > > 1) (parallel [(set (pc) (if_then_else (condition) > > (label_ref (label)) > > (pc))) > > (set (reg) (plus (reg) (const_int -1))) > > (additional clobbers and uses)]) > > > > The branch must be the first entry of the parallel > (also required > > by jump.c), and the second entry of the parallel must > be a set of > > the loop counter register. Some targets (IA-64) wrap the set of > > the loop counter in an if_then_else too. > > > > 2) (set (reg) (plus (reg) (const_int -1)) > > (set (pc) (if_then_else (reg != 0) > > (label_ref (label)) > > (pc))). */ > > > > > > Here is a simple function I used, it should meet all doloop > optimization > > requirements. > > void Unroll( short s, int * restrict b_inout, int *restrict > out, int N) > > { > > int i; > > for (i=0; i<64; i++) > > { > > out[i] = b_inout[i] + s; > > } > > } > > > > > > In tree ivcanon pass, it is converted to > > ;; Function Unroll (Unroll) > > > > Unroll (short int s, int * restrict b_inout, int * restrict > out, int N) > > { > > unsigned int ivtmp.14; > > int pretmp.9; > > long unsigned int pretmp.8; > > int storetmp.6; > > int i; > > int D.1459; > > int D.1458; > > int D.1457; > > int * D.1456; > > int * D.1455; > > long unsigned int D.1454; > > long unsigned int D.1453; > > > > <bb 2>: > > pretmp.9_8 = (int) s_12(D); > > > > <bb 3>: > > # ivtmp.14_13 = PHI <ivtmp.14_21(4), 64(2)> > > # i_19 = PHI <i_15(4), 0(2)> > > D.1453_3 = (long unsigned int) i_19; > > D.1454_4 = D.1453_3 * 4; > > D.1455_6 = out_5(D) + D.1454_4; > > D.1456_10 = b_inout_9(D) + D.1454_4; > > D.1457_11 = *D.1456_10; > > D.1459_14 = pretmp.9_8 + D.1457_11; > > *D.1455_6 = D.1459_14; > > i_15 = i_19 + 1; > > ivtmp.14_21 = ivtmp.14_13 - 1; > > if (ivtmp.14_21 != 0) > > goto <bb 4>; > > else > > goto <bb 5>; > > > > <bb 4>: > > goto <bb 3>; > > > > <bb 5>: > > return; > > > > } > > > > > > This should match requirements of doloop_condition_get. But after > > ivopts pass, the code is transformed to: > > > > ;; Function Unroll (Unroll) > > > > Unroll (short int s, int * restrict b_inout, int * restrict > out, int N) > > { > > long unsigned int ivtmp.21; > > unsigned int ivtmp.14; > > int pretmp.9; > > long unsigned int pretmp.8; > > int storetmp.6; > > int i; > > int D.1459; > > int D.1458; > > int D.1457; > > int * D.1456; > > int * D.1455; > > long unsigned int D.1454; > > long unsigned int D.1453; > > > > <bb 2>: > > pretmp.9_8 = (int) s_12(D); > > > > <bb 3>: > > # ivtmp.21_7 = PHI <ivtmp.21_16(4), 0(2)> > > D.1457_11 = MEM[base: b_inout_9(D), index: ivtmp.21_7]; > > D.1459_14 = pretmp.9_8 + D.1457_11; > > MEM[base: out_5(D), index: ivtmp.21_7] = D.1459_14; > > ivtmp.21_16 = ivtmp.21_7 + 4; > > if (ivtmp.21_16 != 256) > > goto <bb 4>; > > else > > goto <bb 5>; > > > > <bb 4>: > > goto <bb 3>; > > > > <bb 5>: > > return; > > > > } > > > > > > It is not required canonical form anymore. And later RTL level > > optimizations cannot convert it back. Since it doesn't pass the > > doloop_condition_get test, modulo scheduling pass doesn't > work too. Do > > I miss something here? Any hint is greatly appreciated. > > > > Cheers, > > Bingfeng Mei > > > > > > > > > > -- > Ramana Radhakrishnan > >