Re: [PATCH 13/25] Create TARGET_DISABLE_CURRENT_VECTOR_SIZE

Richard Sandiford Tue, 18 Sep 2018 04:22:50 -0700

Andrew Stubbs <a...@codesourcery.com> writes:
> On 17/09/18 20:28, Richard Sandiford wrote:
>>> This patch simply disables the cache so that it must ask the backend for the
>>> preferred mode for every type.
>> 
>> TBH I'm surprised this works.  Obviously it does, otherwise you wouldn't
>> have posted it, but it seems like an accident.  Various parts of the
>> vectoriser query current_vector_size and expect it to be stable for
>> the current choice of vector size.
>
> Indeed, this is why this remains only a half-baked patch: I wasn't 
> confident it was the correct or whole solution.
>
> It works in so much as it fixes the immediate problem that I saw -- "no 
> vector type" -- and makes a bunch of vect.exp testcases happy.
>
> It's quite possible that something else is unhappy with this.
>
>> The underlying problem also affects (at least) base AArch64, SVE and x86_64.
>> We try to choose vector types on the fly based only on the type of a given
>> scalar value, but in reality, the type we want for a 32-bit element (say)
>> often depends on whether the vectorisation region also has smaller or
>> larger elements.  And in general we only know that after
>> vect_mark_stmts_to_be_vectorized, but we want to know the vector types
>> earlier, such as in pattern recognition and while building SLP trees.
>> It's a bit of a chicken-and-egg problem...
>
> I don't understand why the number of bits in a vector is the key 
> information here?


Arguably it shouldn't be, and it's really just a proxy for the vector
(sub)architecture.  But this is "should be" vs. "is" :-)

> It would make sense if you were to say that the number of elements has 
> to be fixed in a given region, because obviously that's tied to loop 
> strides and such, but why the size?
>
> It seems like there is an architecture were you don't want to mix 
> instruction types (SSE vs. AVX?) and that makes sense for that 
> architecture, but if that's the case then we need to be able to turn it 
> off for other architectures.

It's not about trying to avoid mixing vector sizes: from what Jakub
said earlier in the year, even x86 wants to do that (but can't yet).
The idea is instead to try the available possibilities.

E.g. for AArch64 we want to try SVE, 128-bit Advanced SIMD and
64-bit Advanced SIMD.  With something like:

  int *ip;
  short *sp;
  for (int i = 0; i < n; ++i)
    ip[i] = sp[i];

there are three valid choices for Advanced SIMD:

(1) use 1 128-bit vector of sp and 2 128-bit vectors of ip
(2) use 1 64-bit vector of sp and 2 64-bit vectors of ip
(3) use 1 64-bit vector of sp and 1 128-bit vector of ip

At the moment we only try (1) and (2), but in practice, (3) should be
better than (2) in most cases.  I guess in some ways trying all three
would be best, but if we only try two, trying (1) and (3) is better
than trying (1) and (2).

For:

  for (int i = 0; i < n; ++i)
    ip[i] += 1;

there are two valid choices for Advanced SIMD:

(4) use 1 128-bit vector of ip
(5) use 1 64-bit vector of ip

The problem for the current autovec set-up is that the ip type for
64-bit Advanced SIMD varies between (3) and (5): for (3) it's a
128-bit vector type and for (5) it's a 64-bit vector type.
So the type we want for a given vector subarchitecture is partly
determined by the other types in the region: it isn't simply a
function of the subarchitecture and the element type.

This is why the current autovec code only supports (1), (2),
(4) and (5).  And I think this is essentially the same limitation
that you're hitting.

> For GCN, vectors are fully maskable, so we almost want such 
> considerations to be completely ignored.  We basically want it to act 
> like it can have any size vector it likes, up to 64 elements.

SVE is similar.  But even for SVE there's an equivalent trade-off
between (1) and (3):

(1') use 1 fully-populated vector for sp and 2 fully-populated
     vectors for ip
(3') use 1 half-populated vector for sp and 1 fully-populated
     vector for ip

Which is best for more complicated examples depends on the balance
between ip-based work and sp-based work.  The packing and unpacking
in (1') has a cost, but it would pay off if there was much more
sp work than ip work, since in that case (3') would spend most
of its time operating on partially-populated vectors.

Would the same be useful for GCN, or do you basically always
want a VF of 64?

None of this is a fundamental restriction in theory.  It's just
something that needs to be fixed.

One approach would be to get the loop vectoriser to iterate over the
number of lanes the target supports insteaad of all possible vector
sizes.  The problem is that on its own this would mean trying 4
lane counts even on targets with a single supported vector size.
So we'd need to do something a bit smarter...

Richard

Re: [PATCH 13/25] Create TARGET_DISABLE_CURRENT_VECTOR_SIZE

Reply via email to