... > > > > > I'm contemplating adding a tree- and gimple-level VEC_PERMUTE_EXPR of > > the form: > > > > VEC_PERMUTE_EXPR (vlow, vhigh, vperm) > > > > which would be exactly equal to > > > > (vec_select > > (vec_concat vlow vhigh) > > vperm) > > > > at the rtl level. I.e. vperm is an integral vector of the same number > > of elements as vlow. > > > > Truly variable permutation is something that's only supported by ppc and > > spu. > > Also Altivec and SPU support byte permutation (and not only element > permutation), however, the vectorizer does not make use of this at present. >
Yes. I was trying to think if it would be useful to express byte-permutations instead of element-permutations, but the only two useful cases that came to mind are things we have covered by other, probably more appropriate, idioms. [One is realignment (for which we use the builtin_mask_for_load + REALIGN_LOAD). The other is the VEC_PACK_TRUNC idiom (where the number of elements in 'vperm' would be twice the number of elements as 'vlow'), but other VEC_PACK variants are a little more than just a special case of permute.] So (unless we want VEC_PERMUTE to cover these cases, which I think we don't), an element-wise permutations should suffice, so sounds like a good suggestion to me. > > Intel AVX has a limited variable permutation -- 64-bit or 32-bit > > elements can be rearranged but only within a 128-bit subvector. > > So if you're working with 128-bit vectors, it's fully variable, but if > > you're working with 256-bit vectors, it's like doing 2 128-bit permute > > operations in parallel. Intel before AVX has no variable permute. > > > > HOWEVER! Most of the useful permutations that I can think of for the > > optimizers to generate are actually constant. And these can be > > implemented everywhere (with varying degrees of efficiency). > > That's true for the moment, but there are cases where a variable permute would be useful for vectorization. E.g. where vectors are used as a lookup table. One example I know of is for finding delimiters (e.g. for XML processing) - a lookup table of 256 bits holds one bit per ASCII character to indicates if a character is a delimiter or not, and the scalar code looks something like this: table[256]={1,0,0,....}; for (i...) if (table[data[i]] == 1) {found delimiter} ...and this is vectorized with 2 vector registers that hold the lookup table and a shift on the input data vector to create the permutation mask to access the table. I think there should be other examples for lookup tables like that used for vectorization. I also saw variable permutes used for sorting ( http://www.dia.eui.upm.es/asignatu/pro_par/articulos/AASort.pdf). Indeed there are some serious challenges to overcome in order to do all that automatically in the compiler... but some pattern-matching based vectorization approach could conceptually do this. Also, if one day someone was to introduce platform-independent vector intrinsics, then such a generic permute would allow programmers to take advantage of it, even for the cases that would be otherwise too complicated for the compiler to auto-vectorize. So I think it would be nice to allow the more general form, but since it will probably take a while before we actually make use of it, it's probably not critical for the short term... > > Anyway, I'm thinking that it might be better to add such a general > > operation instead of continuing to add things like > > > > VEC_EXTRACT_EVEN_EXPR, > > VEC_EXTRACT_ODD_EXPR, > > VEC_INTERLEAVE_HIGH_EXPR, > > VEC_INTERLEAVE_LOW_EXPR, > > > > and other obvious patterns like broadcast, duplicate even to odd, > > duplicate odd to even, etc. > agreed > If the back end will be able to identify specific masks, e.g., {0,2,4,6} as > extract even operation, then we can certainly remove those codes. > agreed dorit > > > > I can imagine having some sort of target hook that computed a cost > > metric for a given constant permutation pattern. For instance, I'd > > imagine that the interleave patterns are half as expensive as a full > > permute for altivec, due to not having to load a mask. This hook would > > be fairly complicated for x86, given all of the permuting insns that > > were incrementally added in various ISA revisions, but such is life. > > > > In any case, would a VEC_PERMUTE_EXPR, as described above, work for the > > uses of builtin_vec_perm within the vectorizer at present? > > Yes. > > Ira > > > > > > > r~ >