On Mon, Nov 20, 2017 at 1:35 PM, Richard Sandiford <richard.sandif...@linaro.org> wrote: > Richard Biener <richard.guent...@gmail.com> writes: >> On Mon, Nov 20, 2017 at 12:56 AM, Jeff Law <l...@redhat.com> wrote: >>> On 11/09/2017 06:24 AM, Richard Sandiford wrote: >>>> ...so that we can use them for variable-length vectors. For now >>>> constant-length vectors continue to use VEC_PERM_EXPR and the >>>> vec_perm_const optab even for cases that the new optabs could >>>> handle. >>>> >>>> The vector optabs are inconsistent about whether there should be >>>> an underscore before the mode part of the name, but the other lo/hi >>>> optabs have one. >>>> >>>> Doing this means that we're able to optimise some SLP tests using >>>> non-SLP (for now) on targets with variable-length vectors, so the >>>> patch needs to add a few XFAILs. Most of these go away with later >>>> patches. >>>> >>>> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu >>>> and powerpc64le-linus-gnu. OK to install? >>>> >>>> Richard >>>> >>>> >>>> 2017-11-09 Richard Sandiford <richard.sandif...@linaro.org> >>>> Alan Hayward <alan.hayw...@arm.com> >>>> David Sherwood <david.sherw...@arm.com> >>>> >>>> gcc/ >>>> * doc/md.texi (vec_reverse, vec_interleave_lo, vec_interleave_hi) >>>> (vec_extract_even, vec_extract_odd): Document new optabs. >>>> * internal-fn.def (VEC_INTERLEAVE_LO, VEC_INTERLEAVE_HI) >>>> (VEC_EXTRACT_EVEN, VEC_EXTRACT_ODD, VEC_REVERSE): New internal >>>> functions. >>>> * optabs.def (vec_interleave_lo_optab, vec_interleave_hi_optab) >>>> (vec_extract_even_optab, vec_extract_odd_optab, vec_reverse_optab): >>>> New optabs. >>>> * tree-vect-data-refs.c: Include internal-fn.h. >>>> (vect_grouped_store_supported): Try using IFN_VEC_INTERLEAVE_{LO,HI}. >>>> (vect_permute_store_chain): Use them here too. >>>> (vect_grouped_load_supported): Try using IFN_VEC_EXTRACT_{EVEN,ODD}. >>>> (vect_permute_load_chain): Use them here too. >>>> * tree-vect-stmts.c (can_reverse_vector_p): New function. >>>> (get_negative_load_store_type): Use it. >>>> (reverse_vector): New function. >>>> (vectorizable_store, vectorizable_load): Use it. >>>> * config/aarch64/iterators.md (perm_optab): New iterator. >>>> * config/aarch64/aarch64-sve.md (<perm_optab>_<mode>): New expander. >>>> (vec_reverse_<mode>): Likewise. >>>> >>>> gcc/testsuite/ >>>> * gcc.dg/vect/no-vfa-vect-depend-2.c: Remove XFAIL. >>>> * gcc.dg/vect/no-vfa-vect-depend-3.c: Likewise. >>>> * gcc.dg/vect/pr33953.c: XFAIL for vect_variable_length. >>>> * gcc.dg/vect/pr68445.c: Likewise. >>>> * gcc.dg/vect/slp-12a.c: Likewise. >>>> * gcc.dg/vect/slp-13-big-array.c: Likewise. >>>> * gcc.dg/vect/slp-13.c: Likewise. >>>> * gcc.dg/vect/slp-14.c: Likewise. >>>> * gcc.dg/vect/slp-15.c: Likewise. >>>> * gcc.dg/vect/slp-42.c: Likewise. >>>> * gcc.dg/vect/slp-multitypes-2.c: Likewise. >>>> * gcc.dg/vect/slp-multitypes-4.c: Likewise. >>>> * gcc.dg/vect/slp-multitypes-5.c: Likewise. >>>> * gcc.dg/vect/slp-reduc-4.c: Likewise. >>>> * gcc.dg/vect/slp-reduc-7.c: Likewise. >>>> * gcc.target/aarch64/sve_vec_perm_2.c: New test. >>>> * gcc.target/aarch64/sve_vec_perm_2_run.c: Likewise. >>>> * gcc.target/aarch64/sve_vec_perm_3.c: New test. >>>> * gcc.target/aarch64/sve_vec_perm_3_run.c: Likewise. >>>> * gcc.target/aarch64/sve_vec_perm_4.c: New test. >>>> * gcc.target/aarch64/sve_vec_perm_4_run.c: Likewise. >>> OK. >> >> It's really a step backwards - we had those optabs and a tree code in >> the past and >> canonicalizing things to VEC_PERM_EXPR made things simpler. >> >> Why doesn't VEC_PERM <v1, v2, that-constant-series-expr-thing> not work? > > The problems with that are: > > - It doesn't work for vectors with 256-bit elements because the indices > wrap round.
That's a general issue that would need to be addressed for larger vectors (GCN?). I presume the requirement that the permutation vector have the same size needs to be relaxed. > - Supporting a fake VEC_PERM_EXPR <v256qi, v256qi, v256hi> for a few > special cases would be hard, especially since v256hi isn't a normal > vector mode. I imagine everything dealing with VEC_PERM_EXPR would > then have to worry about that special case. I think it's not really a special case - any code here should just expect the same number of vector elements and not a particular size. You already dealt with using a char[] vector for permutations I think. > - VEC_SERIES_CST only copes naturally with EXTRACT_EVEN, EXTRACT_ODD > and REVERSE. INTERLEAVE_LO is { 0, N/2, 1, N/2+1, ... }. > I guess it's possible to represent that using a combination of > shifts, masks, and additions, but then: > > 1) when generating them, we'd need to make sure that we cost the > operation as a single permute, rather than costing all the shifts, > masks and additions > > 2) we'd need to make sure that all gimple optimisations that run > afterwards don't perturb the sequence, otherwise we'll end up > with something that's very expensive. > > 3) that sequence wouldn't be handled by existing VEC_PERM_EXPR > optimisations, and it wouldn't be trivial to add it, since we'd > need to re-recognise the sequence first. > > 4) expand would need to re-recognise the sequence and use the > optab anyway. Well, the answer is of course that you just need a more powerful VEC_SERIES_CST that can handle INTERLEAVE_HI/LO. It seems to me SVE can generate such masks relatively cheaply -- do a 0, 1, 2, 3... sequence and then do a INTERLEAVE_HI/LO on it. So it makes sense that we can directly specify it. Suggested fix: add a interleaved bit to VEC_SERIES_CST. At least I'd like to see it used for the cases it can already handle. VEC_PERM_EXPR is supposed to be the only permutation operation, if it cannot handle some cases it needs to be fixed / its constraints relaxed (like the v256qi case). > Using an internal function seems much simpler :-) > > I think VEC_PERM_EXPR is useful because it represents the same > operation as __builtin_shuffle, and we want to optimise that as best > we can. But these internal functions are only used by the vectoriser, > which should always see what the final form of the permute should be. You hope so. We have several cases where later unrolling and CSE/forwprop optimize permutations away. Richard. > Thanks, > Richard