On June 19, 2019 2:46:15 PM GMT+02:00, Richard Sandiford <richard.sandif...@arm.com> wrote: >Richard Biener <rguent...@suse.de> writes: >> On June 19, 2019 11:05:42 AM GMT+02:00, Richard Sandiford ><richard.sandif...@arm.com> wrote: >>>Richard Biener <rguent...@suse.de> writes: >>>> On June 19, 2019 10:55:16 AM GMT+02:00, Jakub Jelinek >>><ja...@redhat.com> wrote: >>>>>Hi! >>>>> >>>>>When VEC_[LR]SHIFT_EXPR has been replaced with VEC_PERM_EXPR, >>>>>vec_shl_optab >>>>>has been removed as unused, because we only used vec_shr_optab for >>>the >>>>>reductions. >>>>>Without this patch the vect-simd-*.c tests can be vectorized just >>>fine >>>>>for SSE4 and above, but can't be with SSE2. As the comment in >>>>>tree-vect-stmts.c tries to explain, for the inclusive scan >operation >>>we >>>>>want (when using V8SImode vectors): >>>>> _30 = MEM <vector(8) int> [(int *)&D.2043]; >>>>> _31 = MEM <vector(8) int> [(int *)&D.2042]; >>>>> _32 = VEC_PERM_EXPR <_31, _40, { 8, 0, 1, 2, 3, 4, 5, 6 }>; >>>>> _33 = _31 + _32; >>>>> // _33 = { _31[0], _31[0]+_31[1], _31[1]+_31[2], ..., >_31[6]+_31[7] >>>}; >>>>> _34 = VEC_PERM_EXPR <_33, _40, { 8, 9, 0, 1, 2, 3, 4, 5 }>; >>>>> _35 = _33 + _34; >>>>> // _35 = { _31[0], _31[0]+_31[1], _31[0]+.._31[2], >>>_31[0]+.._31[3], >>>>> // _31[1]+.._31[4], ... _31[4]+.._31[7] }; >>>>> _36 = VEC_PERM_EXPR <_35, _40, { 8, 9, 10, 11, 0, 1, 2, 3 >}>; >>>>> _37 = _35 + _36; >>>>> // _37 = { _31[0], _31[0]+_31[1], _31[0]+.._31[2], >>>_31[0]+.._31[3], >>>>> // _31[0]+.._31[4], ... _31[0]+.._31[7] }; >>>>> _38 = _30 + _37; >>>>> _39 = VEC_PERM_EXPR <_38, _38, { 7, 7, 7, 7, 7, 7, 7, 7 }>; >>>>> MEM <vector(8) int> [(int *)&D.2043] = _39; >>>>> MEM <vector(8) int> [(int *)&D.2042] = _38; */ >>>>>For V4SImode vectors that would be VEC_PERM_EXPR <x, init, { 4, 0, >1, >>>2 >>>>>}>, >>>>>VEC_PERM_EXPR <x2, init, { 4, 5, 0, 1 }> and >>>>>VEC_PERM_EXPR <x3, init, { 3, 3, 3, 3 }> etc. >>>>>Unfortunately, SSE2 can't do the VEC_PERM_EXPR <x, init, { 4, 0, 1, >2 >>>>>}> >>>>>permutation (the other two it can do). Well, to be precise, it can >>>do >>>>>it >>>>>using the vector left shift which has been removed as unused, >>>provided >>>>>that init is initializer_zerop (shifting all zeros from the left). >>>>>init usually is all zeros, that is the neutral element of additive >>>>>reductions and couple of others too, in the unlikely case that some >>>>>other >>>>>reduction is used with scan (multiplication, minimum, maximum, >>>bitwise >>>>>and), >>>>>we can use a VEC_COND_EXPR with constant first argument, i.e. a >blend >>>>>or >>>>>and/or. >>>>> >>>>>So, this patch reintroduces vec_shl_optab (most backends actually >>>have >>>>>those >>>>>patterns already) and handles its expansion and vector generic >>>lowering >>>>>similarly to vec_shr_optab - i.e. it is a VEC_PERM_EXPR where the >>>first >>>>>operand is initializer_zerop and third operand starts with a few >>>>>numbers >>>>>smaller than number of elements (doesn't matter which one, as all >>>>>elements >>>>>are same - zero) followed by nelts, nelts+1, nelts+2, ... >>>>>Unlike vec_shr_optab which has zero as the second operand, this one >>>has >>>>>it >>>>>as first operand, because VEC_PERM_EXPR canonicalization wants to >>>have >>>>>first element selector smaller than number of elements. And unlike >>>>>vec_shr_optab, where we also have a fallback in >>>have_whole_vector_shift >>>>>using normal permutations, this one doesn't need it, that >"fallback" >>>is >>>>>tried >>>>>first before vec_shl_optab. >>>>> >>>>>For the vec_shl_optab checks, it tests only for constant number of >>>>>elements >>>>>vectors, not really sure if our VECTOR_CST encoding can express the >>>>>left >>>>>shifts in any way nor whether SVE supports those (I see aarch64 has >>>>>vec_shl_insert but that is just a fixed shift by element bits and >>>>>shifts in >>>>>a scalar rather than zeros). >>>>> >>>>>Bootstrapped/regtested on x86_64-linux and i686-linux, ok for >trunk? >>>> >>>> Ok. >>> >>>I think it would be worth instead telling evpc that the second >permute >>>vector is zero. Permutes with a second vector of zero are somewhat >>>special for SVE too, and could be in other cases for other targets. >>> >>>I thought the direction of travel was not to have optabs for specific >>>kinds of permute any more. E.g. zip, unzip and blend are common >>>permutes too, but we decided not to commonise those. >> >> The issue is that vec_perm_const_ok doesn't have enough info here >> (second operand is all zeros). So the optab is needed to query target >> capabilities, not so much for the actual expansion which could go via >> vec_perm_const. > >Not at the moment, sure. But my point was that we could add that >information instead of doing what the patch does, since the information >could be useful in other cases too. > >These days all permute queries go through the same target hook as the >expansion: targetm.vec_perm_const, So the target interface itself >should adapt naturally. > >For can_vec_perm_const_p we could either add zeroness information >to vec_perm_indices or provide it separately (e.g. with tree inputs). >In the latter case the zeroness could be relayed to >targetm.vec_perm_const by passing zero rtxes even for queries. > >> I thought of having special permute vector entries to denote zero or >> don't - care which would make it possible to Have this info and allow >> these permutes to be single vector permutes. But then encoding this >> might be awkward. > >ISTM similar to the way that we already pass down whether the two >vector >inputs are equal. > >The patch is adding code to places that would not be patched in the >same >way if we already passed down information about zeroness. So it feels >like we're adding back code that we already want to take out again at >some point, unless we change policy and allow specific optabs for >common permutes. (I'd be fine with that FWIW.)
I don't care too much about this implementation detail but in the end I'd like to have better costing for permutes where I think code would be simpler if we just have a single interface to the target. IIRC we kept the right shift variant only because the vectorizer relied on it and we were too lazy to adjust the targets. Richard. >Thanks, >Richard