On June 19, 2019 2:46:15 PM GMT+02:00, Richard Sandiford 
<richard.sandif...@arm.com> wrote:
>Richard Biener <rguent...@suse.de> writes:
>> On June 19, 2019 11:05:42 AM GMT+02:00, Richard Sandiford
><richard.sandif...@arm.com> wrote:
>>>Richard Biener <rguent...@suse.de> writes:
>>>> On June 19, 2019 10:55:16 AM GMT+02:00, Jakub Jelinek
>>><ja...@redhat.com> wrote:
>>>>>Hi!
>>>>>
>>>>>When VEC_[LR]SHIFT_EXPR has been replaced with VEC_PERM_EXPR,
>>>>>vec_shl_optab
>>>>>has been removed as unused, because we only used vec_shr_optab for
>>>the
>>>>>reductions.
>>>>>Without this patch the vect-simd-*.c tests can be vectorized just
>>>fine
>>>>>for SSE4 and above, but can't be with SSE2.  As the comment in
>>>>>tree-vect-stmts.c tries to explain, for the inclusive scan
>operation
>>>we
>>>>>want (when using V8SImode vectors):
>>>>>       _30 = MEM <vector(8) int> [(int *)&D.2043];
>>>>>       _31 = MEM <vector(8) int> [(int *)&D.2042];
>>>>>       _32 = VEC_PERM_EXPR <_31, _40, { 8, 0, 1, 2, 3, 4, 5, 6 }>;
>>>>>       _33 = _31 + _32;
>>>>> // _33 = { _31[0], _31[0]+_31[1], _31[1]+_31[2], ...,
>_31[6]+_31[7]
>>>};
>>>>>       _34 = VEC_PERM_EXPR <_33, _40, { 8, 9, 0, 1, 2, 3, 4, 5 }>;
>>>>>       _35 = _33 + _34;
>>>>>    // _35 = { _31[0], _31[0]+_31[1], _31[0]+.._31[2],
>>>_31[0]+.._31[3],
>>>>>       //         _31[1]+.._31[4], ... _31[4]+.._31[7] };
>>>>>       _36 = VEC_PERM_EXPR <_35, _40, { 8, 9, 10, 11, 0, 1, 2, 3
>}>;
>>>>>       _37 = _35 + _36;
>>>>>    // _37 = { _31[0], _31[0]+_31[1], _31[0]+.._31[2],
>>>_31[0]+.._31[3],
>>>>>       //         _31[0]+.._31[4], ... _31[0]+.._31[7] };
>>>>>       _38 = _30 + _37;
>>>>>       _39 = VEC_PERM_EXPR <_38, _38, { 7, 7, 7, 7, 7, 7, 7, 7 }>;
>>>>>       MEM <vector(8) int> [(int *)&D.2043] = _39;
>>>>>       MEM <vector(8) int> [(int *)&D.2042] = _38;  */
>>>>>For V4SImode vectors that would be VEC_PERM_EXPR <x, init, { 4, 0,
>1,
>>>2
>>>>>}>,
>>>>>VEC_PERM_EXPR <x2, init, { 4, 5, 0, 1 }> and
>>>>>VEC_PERM_EXPR <x3, init, { 3, 3, 3, 3 }> etc.
>>>>>Unfortunately, SSE2 can't do the VEC_PERM_EXPR <x, init, { 4, 0, 1,
>2
>>>>>}>
>>>>>permutation (the other two it can do).  Well, to be precise, it can
>>>do
>>>>>it
>>>>>using the vector left shift which has been removed as unused,
>>>provided
>>>>>that init is initializer_zerop (shifting all zeros from the left).
>>>>>init usually is all zeros, that is the neutral element of additive
>>>>>reductions and couple of others too, in the unlikely case that some
>>>>>other
>>>>>reduction is used with scan (multiplication, minimum, maximum,
>>>bitwise
>>>>>and),
>>>>>we can use a VEC_COND_EXPR with constant first argument, i.e. a
>blend
>>>>>or
>>>>>and/or.
>>>>>
>>>>>So, this patch reintroduces vec_shl_optab (most backends actually
>>>have
>>>>>those
>>>>>patterns already) and handles its expansion and vector generic
>>>lowering
>>>>>similarly to vec_shr_optab - i.e. it is a VEC_PERM_EXPR where the
>>>first
>>>>>operand is initializer_zerop and third operand starts with a few
>>>>>numbers
>>>>>smaller than number of elements (doesn't matter which one, as all
>>>>>elements
>>>>>are same - zero) followed by nelts, nelts+1, nelts+2, ...
>>>>>Unlike vec_shr_optab which has zero as the second operand, this one
>>>has
>>>>>it
>>>>>as first operand, because VEC_PERM_EXPR canonicalization wants to
>>>have
>>>>>first element selector smaller than number of elements.  And unlike
>>>>>vec_shr_optab, where we also have a fallback in
>>>have_whole_vector_shift
>>>>>using normal permutations, this one doesn't need it, that
>"fallback"
>>>is
>>>>>tried
>>>>>first before vec_shl_optab.
>>>>>
>>>>>For the vec_shl_optab checks, it tests only for constant number of
>>>>>elements
>>>>>vectors, not really sure if our VECTOR_CST encoding can express the
>>>>>left
>>>>>shifts in any way nor whether SVE supports those (I see aarch64 has
>>>>>vec_shl_insert but that is just a fixed shift by element bits and
>>>>>shifts in
>>>>>a scalar rather than zeros).
>>>>>
>>>>>Bootstrapped/regtested on x86_64-linux and i686-linux, ok for
>trunk?
>>>>
>>>> Ok. 
>>>
>>>I think it would be worth instead telling evpc that the second
>permute
>>>vector is zero.  Permutes with a second vector of zero are somewhat
>>>special for SVE too, and could be in other cases for other targets.
>>>
>>>I thought the direction of travel was not to have optabs for specific
>>>kinds of permute any more.  E.g. zip, unzip and blend are common
>>>permutes too, but we decided not to commonise those.
>>
>> The issue is that vec_perm_const_ok doesn't have enough info here
>> (second operand is all zeros). So the optab is needed to query target
>> capabilities, not so much for the actual expansion which could go via
>> vec_perm_const.
>
>Not at the moment, sure.  But my point was that we could add that
>information instead of doing what the patch does, since the information
>could be useful in other cases too.
>
>These days all permute queries go through the same target hook as the
>expansion: targetm.vec_perm_const,  So the target interface itself
>should adapt naturally.
>
>For can_vec_perm_const_p we could either add zeroness information
>to vec_perm_indices or provide it separately (e.g. with tree inputs).
>In the latter case the zeroness could be relayed to
>targetm.vec_perm_const by passing zero rtxes even for queries.
>
>> I thought of having special permute vector entries to denote zero or
>> don't - care which would make it possible to Have this info and allow
>> these permutes to be single vector permutes. But then encoding this
>> might be awkward.
>
>ISTM similar to the way that we already pass down whether the two
>vector
>inputs are equal.
>
>The patch is adding code to places that would not be patched in the
>same
>way if we already passed down information about zeroness.  So it feels
>like we're adding back code that we already want to take out again at
>some point, unless we change policy and allow specific optabs for
>common permutes.  (I'd be fine with that FWIW.)

I don't care too much about this implementation detail but in the end I'd like 
to have better costing for permutes where I think code would be simpler if we 
just have a single interface to the target. 
IIRC we kept the right shift variant only because the vectorizer relied on it 
and we were too lazy to adjust the targets.

Richard. 

>Thanks,
>Richard

Reply via email to