On Mon, 6 Jan 2020, Kewen.Lin wrote: > Hi all, > > Recently I'm investigating on an issue related to use D-form/X-form vector > memory access, it's the same as what the patch > https://gcc.gnu.org/ml/gcc-patches/2019-10/msg01879.html > was intended to deal with. Power9 introduces DQ-form instructions for vector > memory access, we perfer to use DQ-form when unrolling loop. As the example > in the above link, it can save number of ADDI and GPR for indexing. > > Or for below example: > > extern void dummy (double, unsigned n); > > void > func (double *x, double *y, unsigned m, unsigned n) > { > double sacc; > for (unsigned j = 1; j < m; j++) > { > sacc = 0.0; > for (unsigned i = 1; i < n; i++) > sacc = sacc + x[i] * y[i]; > dummy (sacc, n); > } > } > > Core loop with X-form (lxvx): > > mtctr r10 > lxvx vs12,r31,r9 > lxvx vs0,r30,r9 > addi r10,r9,16 > addi r9,r9,32 > xvmaddadp vs32,vs12,vs0 > lxvx vs12,r31,r10 > lxvx vs0,r30,r10 > xvmaddadp vs11,vs12,vs0 > lxvx vs12,r31,r9 > lxvx vs0,r30,r9 > addi r9,r10,32 > xvmaddadp vs32,vs12,vs0 > lxvx vs12,r31,r9 > lxvx vs0,r30,r9 > addi r9,r10,48 > xvmaddadp vs11,vs12,vs0 > lxvx vs12,r31,r9 > lxvx vs0,r30,r9 > addi r9,r10,64 > xvmaddadp vs32,vs12,vs0 > lxvx vs12,r31,r9 > lxvx vs0,r30,r9 > addi r9,r10,80 > xvmaddadp vs11,vs12,vs0 > lxvx vs12,r31,r9 > lxvx vs0,r30,r9 > addi r9,r10,96 > xvmaddadp vs32,vs12,vs0 > lxvx vs12,r31,r9 > lxvx vs0,r30,r9 > addi r9,r10,112 > xvmaddadp vs11,vs12,vs0 > bdnz 190 <func+0x190> > > vs. > > Core loop with D-form (lxv) > mtctr r8 > lxv vs12,0(r9) > lxv vs0,0(r10) > addi r7,r9,16 // r7, r8 can be eliminated further with r9, r10 > addi r8,r10,16 // 2 or 4 addi vs. 8 addi above > addi r9,r9,128 > addi r10,r10,128 > xvmaddadp vs32,vs12,vs0 > lxv vs12,-112(r9) > lxv vs0,-112(r10) > xvmaddadp vs11,vs12,vs0 > lxv vs12,16(r7) > lxv vs0,16(r8) > xvmaddadp vs32,vs12,vs0 > lxv vs12,32(r7) > lxv vs0,32(r8) > xvmaddadp vs11,vs12,vs0 > lxv vs12,48(r7) > lxv vs0,48(r8) > xvmaddadp vs32,vs12,vs0 > lxv vs12,64(r7) > lxv vs0,64(r8) > xvmaddadp vs11,vs12,vs0 > lxv vs12,80(r7) > lxv vs0,80(r8) > xvmaddadp vs32,vs12,vs0 > lxv vs12,96(r7) > lxv vs0,96(r8) > xvmaddadp vs11,vs12,vs0 > bdnz 1b0 <func+0x1b0> > > We are thinking whether it can be handled in IVOPTs instead of one RTL pass. > > During IVOPTs selecting IV cands, it doesn't know the loop will be unrolled so > it doesn't count the possible step cost in with X-form. If we can teach it to > consider the case, the IV cands which plays with D-form can be preferred. > Currently unrolling (incomplete) happens in RTL, it looks we have to predict > the loop whether unroll in IVOPTs. Since there is some parameter checks on > RTL > insn counts and target hooks, it seems not easy to get that. Besides, we need > to check the step is valid to put into D-form field (eg: DQ-form requires > divide > 16 exactly), to ensure no extra ADDIs needed. > > I'm not sure whether it's a good idea to implement in IVOPTs, but I did some > changes in IVOPTs to prove it's doable to get expected codes, the patch is > attached. > > Any comments/suggestions are highly appreiciated!
Is the unrolled code better than the not unrolled code (assuming optimal IV choice)? Then IMHO IVOPTs should drive the unrolling, either by actually doing it or by forcing it via the loop->unroll setting. I don't think second-guessing the RTL unroller at this point is going to work. Alternatively turn X-form into D-form during RTL unrolling? Thanks, Richard. > BR, > Kewen > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)