On Tue, Sep 10, 2013 at 09:25:32AM +0100, Richard Biener wrote: > On Mon, 9 Sep 2013, Marc Glisse wrote: > > > On Mon, 9 Sep 2013, Vidya Praveen wrote: > > > > > Hello, > > > > > > This post details some thoughts on an enhancement to the vectorizer that > > > could take advantage of the SIMD instructions that allows indexed element > > > as an operand thus reducing the need for duplication and possibly improve > > > reuse of previously loaded data. > > > > > > Appreciate your opinion on this. > > > > > > --- > > > > > > A phrase like this: > > > > > > for(i=0;i<4;i++) > > > a[i] = b[i] <op> c[2]; > > > > > > is usually vectorized as: > > > > > > va:V4SI = a[0:3] > > > vb:V4SI = b[0:3] > > > t = c[2] > > > vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at > > > vec_init > > > ... > > > va:V4SI = vb:V4SI <op> vc:V4SI > > > > > > But this could be simplified further if a target has instructions that > > > support > > > indexed element as a parameter. For example an instruction like this: > > > > > > mul v0.4s, v1.4s, v2.4s[2] > > > > > > can perform multiplication of each element of v2.4s with the third element > > > of > > > v2.4s (specified as v2.4s[2]) and store the results in the corresponding > > > elements of v0.4s. > > > > > > For this to happen, vectorizer needs to understand this idiom and treat > > > the > > > operand c[2] specially (and by taking in to consideration if the machine > > > supports indexed element as an operand for <op> through a target hook or > > > macro) > > > and consider this as vectorizable statement without having to duplicate > > > the > > > elements explicitly. > > > > > > There are fews ways this could be represented at gimple: > > > > > > ... > > > va:V4SI = vb:V4SI <op> VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) > > > ... > > > > > > or by allowing a vectorizer treat an indexed element as a valid operand > > > in a > > > vectorizable statement: > > > > Might as well allow any scalar then... > > I agree. The VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) form > would necessarily be two extra separate statements and thus subject > to CSE obfuscating it enough for RTL expansion to no longer notice it.
I also thought about having a specialized expression like VEC_INDEXED_<op>_EXPR < arg0, arg1, arg2, index> to mean: arg0 = arg1 <op> arg2[index] and handle it directly in the expander, like (for eg.) how VEC_LSHIFT_EXPR is handled in expr.c. But I dropped this idea since we may need to introduce many such nodes. > > That said, allowing mixed scalar/vector ops isn't very nice and > your scheme can be simplified by just using > > vc:V4SI = VEC_DUPLICATE_EXPR <...> > va:V4SI = vb:V4SI <op> vc:V4SI > > where the expander only has to see that vc:V4SI is defined by > a duplicate. I did try out something like this quickly before I posted this RFC, though I called it VEC_DUP to mean a equivalent of vec_duplicate(vec_select()) for: for(i=0;i<8;i++) a[i] = b[2] * c[i]; I could generate: ... <bb 8>: _88 = prolog_loop_adjusted_niters.6_60 * 4; vectp_c.13_87 = c_10(D) + _88; vect_ldidx_.16_92 = MEM[(int *)b_8(D) + 8B]; <<<<<<<< vect_idxed_.17_93 = (vect_ldidx_.16_92) <<< ??? >>> (0); <<<<<<<< _96 = prolog_loop_adjusted_niters.6_60 * 4; vectp_a.19_95 = a_6(D) + _96; vect__12.14_115 = MEM[(int *)vectp_c.13_87]; vect_patt_40.15_116 = vect__12.14_115 * vect_idxed_.17_93; <<<<<<<< MEM[(int *)vectp_a.19_95] = vect_patt_40.15_116; <<<<<<<< vectp_c.12_118 = vectp_c.13_87 + 16; vectp_a.18_119 = vectp_a.19_95 + 16; ivtmp_120 = 1; if (ivtmp_120 < bnd.8_62) goto <bb 9>; else goto <bb 11>; <bb 9>: # vectp_c.12_89 = PHI <vectp_c.12_118(8)> # vectp_a.18_97 = PHI <vectp_a.18_119(8)> # ivtmp_14 = PHI <ivtmp_120(8)> vect__12.14_91 = MEM[(int *)vectp_c.12_89]; <<<<<<<< vect_patt_40.15_94 = vect__12.14_91 * vect_idxed_.17_93; <<<<<<<< MEM[(int *)vectp_a.18_97] = vect_patt_40.15_94; ... It's a crude implementation so VEC_DUP is printed as: (vect_ldidx_.16_92) <<< ??? >>> (0); > > > ... > > > va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 2) > > > ... > > > > > > For the sake of explanation, the above two representations assumes that > > > c[0:3] is loaded in vc for some other use and reused here. But when c[2] > > > is > > > the > > > only use of 'c' then it may be safer to just load one element and use it > > > like > > > this: > > > > > > vc:V4SI[0] = c[2] > > > va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 0) > > > > > > This could also mean that expressions involving scalar could be treated > > > similarly. For example, > > > > > > for(i=0;i<4;i++) > > > a[i] = b[i] <op> c > > > > > > could be vectorized as: > > > > > > vc:V4SI[0] = c > > > va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 0) > > > > > > Such a change would also require new standard pattern names to be defined > > > for > > > each <op>. > > > > > > Alternatively, having something like this: > > > > > > ... > > > vt:V4SI = VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) > > > va:V4SI = vb:V4SI <op> vt:V4SI > > > ... > > > > > > would remove the need to introduce several new standard pattern names but > > > have > > > just one to represent vec_duplicate(vec_select()) but ofcourse this will > > > expect > > > the target to have combiner patterns. > > > > The cost estimation wouldn't be very good, but aren't combine patterns > > enough > > for the whole thing? Don't you model your mul instruction as: > > > > (mult:V4SI > > (match_operand:V4SI) > > (vec_duplicate:V4SI (vec_select:SI (match_operand:V4SI)))) > > > > anyway? Seems that combine should be able to handle it. What currently > > happens > > that we fail to generate the right instruction? > > > > In gimple, we already have BIT_FIELD_REF for vec_select and CONSTRUCTOR for > > vec_duplicate, adding new nodes is always painful. > > True, though CONSTRUCTOR isn't a good vec_duplicate primitive. But yes, > we have it that way at the moment and indeed adding new nodes is always > painful. > > > > This enhancement could possibly help further optimizing larger scenarios > > > such > > > as linear systems. > > Given that the vectorizer already handles all cases you quote but > just the expansion doesn't use the targets special abilities - can't > you just teach the expander to lookup the definition of the > vectors and see if it is an uniform CONSTRUCTOR? > > Richard. > I did think of handling this as a part of expanding CONSTRUCTOR but I thought it may not a good idea if we want to enhance this support in the future to handle larger cases like this one (hypothetical example): for i = 0 to 3 for j = 0 to 3 a[j] += b[j] * c[i] to a[0:3] += b[0:3] + c[0] a[0:3] += b[0:3] + c[1] a[0:3] += b[0:3] + c[2] a[0:3] += b[0:3] + c[3] Secondly, I am not sure if introducing a single lane load at the time of expansion and removing or expecting the existing scalar load to be removed later as it is unused, is a good idea. Please advise. Cheers VP