Re: Move code out of vect_slp_analyze_bb_1

Richard Sandiford Mon, 21 Oct 2019 07:07:51 -0700

Richard Biener <richard.guent...@gmail.com> writes:
> On Mon, Oct 21, 2019 at 12:08 PM Richard Sandiford
> <richard.sandif...@arm.com> wrote:
>>
>> Richard Biener <richard.guent...@gmail.com> writes:
>> > On October 20, 2019 2:54:48 PM GMT+02:00, Richard Sandiford 
>> > <richard.sandif...@arm.com> wrote:
>> >>Richard Biener <richard.guent...@gmail.com> writes:
>> >>> On October 19, 2019 5:06:40 PM GMT+02:00, Richard Sandiford
>> >><richard.sandif...@arm.com> wrote:
>> >>>>After the previous patch, it seems more natural to apply the
>> >>>>PARAM_SLP_MAX_INSNS_IN_BB threshold as soon as we know what
>> >>>>the region is, rather than delaying it to vect_slp_analyze_bb_1.
>> >>>>(But rather than carve out the biggest region possible and then
>> >>>>reject it, wouldn't it be better to stop when the region gets
>> >>>>too big, so we at least have a chance of vectorising something?)
>> >>>>
>> >>>>It also seems more natural for vect_slp_bb_region to create the
>> >>>>bb_vec_info itself rather than (a) having to pass bits of data down
>> >>>>for the initialisation and (b) forcing vect_slp_analyze_bb_1 to free
>> >>>>on every failure return.
>> >>>>
>> >>>>Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?
>> >>>
>> >>> Ok. But I wonder what the reason was for this limit? Dependency
>> >>analysis was greatly simplified, being no longer quadratic here. Can
>> >>you check where the limit originally came from? But indeed splitting
>> >>the region makes more sense then, but at dataref group boundaries I'd
>> >>say.
>> >>
>> >>Yeah, looks it was the complexity of dependence analysis:
>> >>
>> >>  https://gcc.gnu.org/ml/gcc-patches/2009-05/msg01303.html
>> >
>> > OK. We no longer
>> > Call compute dependence between all memory refs but only verify we can do 
>> > those code-motions we need to do. That's of course much harder to check a 
>> > bound on upfront (it's still quadratic in the worst case). I'm also not 
>> > sure this is ever a problem, but we might instead count the number of 
>> > stmts involving memory?
>>
>> OK, here's what I tried.  It exposes quadratic behaviour for
>> gcc.dg/pr87985.c on x86_64 though.  To make things worse, this is
>> all in the name of "vectorising" using V1DI. :-)
>>
>> In that test we have N data references in which the even elements appear
>> to be vectorisable but the odd elements don't.  First time round, we hit:
>>
>>   /* For basic block SLP, try to break the group up into multiples of the
>>      vector size.  */
>>   unsigned HOST_WIDE_INT const_nunits;
>>   if (is_a <bb_vec_info> (vinfo)
>>       && STMT_VINFO_GROUPED_ACCESS (stmt_info)
>>       && DR_GROUP_FIRST_ELEMENT (stmt_info)
>>       && nunits.is_constant (&const_nunits))
>>
>> This breaks off the leading even element and recurses, discards the
>> leading odd element, and then recurses on the rest.  We then come round
>> to the same code when recursing on the rest.
>>
>> It looks like matches is valid for all elements, so I guess we should
>> divide the group based on all false elements, not just the first?
>
> What do you mean with "valid for all elements"?  matches[i] is
> true or false for all elements?  If it is true we shouldn't split
> at all.


Initially I'd wondered whether the elements were undefined after the
first "false", i.e. if we broke early at that point.  But it looks like
true and false elements after the first false are still meaningful.

> But indeed the recursion is bad if we split out one vector a step,
> we should split the whole group in "half" instead.
>
> Still the code should ensure we have a "half" that works OK
> and one that possibly doesn't.
>
> OK, I see we have {true, true, false <repeats ... times> } for
> matches but the 2nd iteration gets {true, false, ... } and doesn't
> split for me.  Of course if you have V1DI then you'll notice
> that matches[0] is _always_ true ... I guess we should simply
> never try splitting for const_nunits == 1?  Or even never
> do BB vectorization with such a vector type.

I don't think it's specific to const_nunits == 1.  E.g. if we have
matches == { true, true, false, true } repeating for const_nunits == 2,
we'll have the same problem of trying:

  N
  2, N-4
     2, N-8
        2, N-12,
           2, N-16
              ....

which is still quadratic.

Dividing genuinely into half would fix that, so would dividing the
whole group based on the group1_size.  Or we could try using the
whole matches array to divide the group up into chunks that are
worth recursing on, with each chunk having at most half the original
group size and with the matches elements for each chunk being all-true
or all-false.

Thanks,
Richard

Re: Move code out of vect_slp_analyze_bb_1

Reply via email to