On Tue, 24 Sep 2013, Vidya Praveen wrote:

> On Mon, Sep 09, 2013 at 07:02:52PM +0100, Marc Glisse wrote:
> > On Mon, 9 Sep 2013, Vidya Praveen wrote:
> > 
> > > Hello,
> > >
> > > This post details some thoughts on an enhancement to the vectorizer that
> > > could take advantage of the SIMD instructions that allows indexed element
> > > as an operand thus reducing the need for duplication and possibly improve
> > > reuse of previously loaded data.
> > >
> > > Appreciate your opinion on this.
> > >
> > > ---
> > >
> > > A phrase like this:
> > >
> > > for(i=0;i<4;i++)
> > >   a[i] = b[i] <op> c[2];
> > >
> > > is usually vectorized as:
> > >
> > >  va:V4SI = a[0:3]
> > >  vb:V4SI = b[0:3]
> > >  t = c[2]
> > >  vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at 
> > > vec_init
> > >  ...
> > >  va:V4SI = vb:V4SI <op> vc:V4SI
> > >
> > > But this could be simplified further if a target has instructions that 
> > > support
> > > indexed element as a parameter. For example an instruction like this:
> > >
> > >  mul v0.4s, v1.4s, v2.4s[2]
> > >
> > > can perform multiplication of each element of v2.4s with the third 
> > > element of
> > > v2.4s (specified as v2.4s[2]) and store the results in the corresponding
> > > elements of v0.4s.
> > >
> > > For this to happen, vectorizer needs to understand this idiom and treat 
> > > the
> > > operand c[2] specially (and by taking in to consideration if the machine
> > > supports indexed element as an operand for <op> through a target hook or 
> > > macro)
> > > and consider this as vectorizable statement without having to duplicate 
> > > the
> > > elements explicitly.
> > >
> > > There are fews ways this could be represented at gimple:
> > >
> > >  ...
> > >  va:V4SI = vb:V4SI <op> VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
> > >  ...
> > >
> > > or by allowing a vectorizer treat an indexed element as a valid operand 
> > > in a
> > > vectorizable statement:
> > 
> > Might as well allow any scalar then...
> 
> Yes, I had given an example below.
> 
> > 
> > >  ...
> > >  va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 2)
> > >  ...
> > >
> > > For the sake of explanation, the above two representations assumes that
> > > c[0:3] is loaded in vc for some other use and reused here. But when c[2] 
> > > is the
> > > only use of 'c' then it may be safer to just load one element and use it 
> > > like
> > > this:
> > >
> > >  vc:V4SI[0] = c[2]
> > >  va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 0)
> > >
> > > This could also mean that expressions involving scalar could be treated
> > > similarly. For example,
> > >
> > >  for(i=0;i<4;i++)
> > >    a[i] = b[i] <op> c
> > >
> > > could be vectorized as:
> > >
> > >  vc:V4SI[0] = c
> > >  va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 0)
> > >
> > > Such a change would also require new standard pattern names to be defined 
> > > for
> > > each <op>.
> > >
> > > Alternatively, having something like this:
> > >
> > >  ...
> > >  vt:V4SI = VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
> > >  va:V4SI = vb:V4SI <op> vt:V4SI
> > >  ...
> > >
> > > would remove the need to introduce several new standard pattern names but 
> > > have
> > > just one to represent vec_duplicate(vec_select()) but ofcourse this will 
> > > expect
> > > the target to have combiner patterns.
> > 
> > The cost estimation wouldn't be very good, but aren't combine patterns 
> > enough for the whole thing? Don't you model your mul instruction as:
> > 
> > (mult:V4SI
> >    (match_operand:V4SI)
> >    (vec_duplicate:V4SI (vec_select:SI (match_operand:V4SI))))
> > 
> > anyway? Seems that combine should be able to handle it. What currently 
> > happens that we fail to generate the right instruction?
> 
> At vec_init, I can recognize an idiom in order to generate vec_duplicate but
> I can't really insist on the single lane load.. something like:
> 
> vc:V4SI[0] = c
> vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0)
> va:V4SI = vb:V4SI <op> vt:V4SI
> 
> Or is there any other way to do this?
Can you elaborate on "I can't really insist on the single lane load"?
What's the single lane load in your example?  I'd expect the instruction
pattern as quoted to just work (and I hope we expand an uniform
constructor { a, a, a, a } properly using vec_duplicate).

Richard.

> Cheers
> VP
> 
> > 
> > In gimple, we already have BIT_FIELD_REF for vec_select and CONSTRUCTOR 
> > for vec_duplicate, adding new nodes is always painful.
> > 
> > > This enhancement could possibly help further optimizing larger scenarios 
> > > such
> > > as linear systems.
> > >
> > > Regards
> > > VP
> > 
> > -- 
> > Marc Glisse
> >
> 
> 
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE / SUSE Labs
SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
GF: Jeff Hawn, Jennifer Guild, Felix Imend

Reply via email to