Andrew Stubbs <a...@codesourcery.com> writes: > On 17/09/18 20:28, Richard Sandiford wrote: >>> This patch simply disables the cache so that it must ask the backend for the >>> preferred mode for every type. >> >> TBH I'm surprised this works. Obviously it does, otherwise you wouldn't >> have posted it, but it seems like an accident. Various parts of the >> vectoriser query current_vector_size and expect it to be stable for >> the current choice of vector size. > > Indeed, this is why this remains only a half-baked patch: I wasn't > confident it was the correct or whole solution. > > It works in so much as it fixes the immediate problem that I saw -- "no > vector type" -- and makes a bunch of vect.exp testcases happy. > > It's quite possible that something else is unhappy with this. > >> The underlying problem also affects (at least) base AArch64, SVE and x86_64. >> We try to choose vector types on the fly based only on the type of a given >> scalar value, but in reality, the type we want for a 32-bit element (say) >> often depends on whether the vectorisation region also has smaller or >> larger elements. And in general we only know that after >> vect_mark_stmts_to_be_vectorized, but we want to know the vector types >> earlier, such as in pattern recognition and while building SLP trees. >> It's a bit of a chicken-and-egg problem... > > I don't understand why the number of bits in a vector is the key > information here?
Arguably it shouldn't be, and it's really just a proxy for the vector (sub)architecture. But this is "should be" vs. "is" :-) > It would make sense if you were to say that the number of elements has > to be fixed in a given region, because obviously that's tied to loop > strides and such, but why the size? > > It seems like there is an architecture were you don't want to mix > instruction types (SSE vs. AVX?) and that makes sense for that > architecture, but if that's the case then we need to be able to turn it > off for other architectures. It's not about trying to avoid mixing vector sizes: from what Jakub said earlier in the year, even x86 wants to do that (but can't yet). The idea is instead to try the available possibilities. E.g. for AArch64 we want to try SVE, 128-bit Advanced SIMD and 64-bit Advanced SIMD. With something like: int *ip; short *sp; for (int i = 0; i < n; ++i) ip[i] = sp[i]; there are three valid choices for Advanced SIMD: (1) use 1 128-bit vector of sp and 2 128-bit vectors of ip (2) use 1 64-bit vector of sp and 2 64-bit vectors of ip (3) use 1 64-bit vector of sp and 1 128-bit vector of ip At the moment we only try (1) and (2), but in practice, (3) should be better than (2) in most cases. I guess in some ways trying all three would be best, but if we only try two, trying (1) and (3) is better than trying (1) and (2). For: for (int i = 0; i < n; ++i) ip[i] += 1; there are two valid choices for Advanced SIMD: (4) use 1 128-bit vector of ip (5) use 1 64-bit vector of ip The problem for the current autovec set-up is that the ip type for 64-bit Advanced SIMD varies between (3) and (5): for (3) it's a 128-bit vector type and for (5) it's a 64-bit vector type. So the type we want for a given vector subarchitecture is partly determined by the other types in the region: it isn't simply a function of the subarchitecture and the element type. This is why the current autovec code only supports (1), (2), (4) and (5). And I think this is essentially the same limitation that you're hitting. > For GCN, vectors are fully maskable, so we almost want such > considerations to be completely ignored. We basically want it to act > like it can have any size vector it likes, up to 64 elements. SVE is similar. But even for SVE there's an equivalent trade-off between (1) and (3): (1') use 1 fully-populated vector for sp and 2 fully-populated vectors for ip (3') use 1 half-populated vector for sp and 1 fully-populated vector for ip Which is best for more complicated examples depends on the balance between ip-based work and sp-based work. The packing and unpacking in (1') has a cost, but it would pay off if there was much more sp work than ip work, since in that case (3') would spend most of its time operating on partially-populated vectors. Would the same be useful for GCN, or do you basically always want a VF of 64? None of this is a fundamental restriction in theory. It's just something that needs to be fixed. One approach would be to get the loop vectoriser to iterate over the number of lanes the target supports insteaad of all possible vector sizes. The problem is that on its own this would mean trying 4 lane counts even on targets with a single supported vector size. So we'd need to do something a bit smarter... Richard