On Wed, Sep 2, 2020 at 11:26 AM luoxhu <luo...@linux.ibm.com> wrote:
>
> Hi,
>
> On 2020/9/1 21:07, Richard Biener wrote:
> > On Tue, Sep 1, 2020 at 10:11 AM luoxhu via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> >>
> >> Hi,
> >>
> >> On 2020/9/1 01:04, Segher Boessenkool wrote:
> >>> Hi!
> >>>
> >>> On Mon, Aug 31, 2020 at 04:06:47AM -0500, Xiong Hu Luo wrote:
> >>>> vec_insert accepts 3 arguments, arg0 is input vector, arg1 is the value
> >>>> to be insert, arg2 is the place to insert arg1 to arg0.  This patch adds
> >>>> __builtin_vec_insert_v4si[v4sf,v2di,v2df,v8hi,v16qi] for vec_insert to
> >>>> not expand too early in gimple stage if arg2 is variable, to avoid 
> >>>> generate
> >>>> store hit load instructions.
> >>>>
> >>>> For Power9 V4SI:
> >>>>       addi 9,1,-16
> >>>>       rldic 6,6,2,60
> >>>>       stxv 34,-16(1)
> >>>>       stwx 5,9,6
> >>>>       lxv 34,-16(1)
> >>>> =>
> >>>>       addis 9,2,.LC0@toc@ha
> >>>>       addi 9,9,.LC0@toc@l
> >>>>       mtvsrwz 33,5
> >>>>       lxv 32,0(9)
> >>>>       sradi 9,6,2
> >>>>       addze 9,9
> >>>>       sldi 9,9,2
> >>>>       subf 9,9,6
> >>>>       subfic 9,9,3
> >>>>       sldi 9,9,2
> >>>>       subfic 9,9,20
> >>>>       lvsl 13,0,9
> >>>>       xxperm 33,33,45
> >>>>       xxperm 32,32,45
> >>>>       xxsel 34,34,33,32
> >>>
> >>> For v a V4SI, x a SI, j some int,  what do we generate for
> >>>     v[j&3] = x;
> >>> ?
> >>> This should be exactly the same as we generate for
> >>>     vec_insert(x, v, j);
> >>> (the builtin does a modulo 4 automatically).
> >>
> >> No, even with my patch "stxv 34,-16(1);stwx 5,9,6;lxv 34,-16(1)" generated 
> >> currently.
> >> Is it feasible and acceptable to expand some kind of pattern in expander 
> >> directly without
> >> builtin transition?
> >>
> >> I borrowed some of implementation from vec_extract.  For vec_extract, the 
> >> issue also exists:
> >>
> >>             source:                             gimple:                    
> >>          expand:                                        asm:
> >> 1) i = vec_extract (v, n);  =>  i = __builtin_vec_ext_v4si (v, n);   => 
> >> {r120:SI=unspec[r118:V4SI,r119:DI] 134;...}     => slwi 9,6,2   vextuwrx 
> >> 3,9,2
> >> 2) i = vec_extract (v, 3);  =>  i = __builtin_vec_ext_v4si (v, 3);   => 
> >> {r120:SI=vec_select(r118:V4SI,parallel)...}     =>  li 9,12      vextuwrx 
> >> 3,9,2
> >> 3) i = v[n%4];   =>  _1 = n & 3;  i = VIEW_CONVERT_EXPR<int[4]>(v)[_1];  
> >> =>        ...        =>     stxv 34,-16(1);addi 9,1,-16; rldic 5,5,2,60; 
> >> lwax 3,9,5
> >> 4) i = v[3];     =>          i = BIT_FIELD_REF <v, 32, 96>;              
> >> =>  {r120:SI=vec_select(r118:V4SI,parallel)...} => li 9,12;   vextuwrx 
> >> 3,9,2
> >
> > Why are 1) and 2) handled differently than 3)/4)?
>
> 1) and 2) are calling builtin function vec_extract, it is defined to
>  __builtin_vec_extract and will be resolved to ALTIVEC_BUILTIN_VEC_EXTRACT
>  by resolve_overloaded_builtin, to generate a call __builtin_vec_ext_v4si
>  to be expanded only in RTL.
> 3) is access variable v as array type with opcode VIEW_CONVERT_EXPR, I
>  guess we should also generate builtin call instead of calling
>  convert_vector_to_array_for_subscript to generate VIEW_CONVERT_EXPR
>  expression for such kind of usage.
> 4) is translated to BIT_FIELD_REF with constant bitstart and bitsize,
> variable v could also be accessed by register instead of stack, so optabs
> could match the rs6000_expand_vector_insert to generate expected instruction
> through extract_bit_field.
>
> >
> >> Case 3) also couldn't handle the similar usage, and case 4) doesn't 
> >> generate builtin as expected,
> >> it just expand to vec_select by coincidence.  So does this mean both 
> >> vec_insert and vec_extract
> >> and all other similar vector builtins should use IFN as suggested by 
> >> Richard Biener, to match the
> >> pattern in gimple and expand both constant and variable index in expander? 
> >>  Will this also be
> >> beneficial for other targets except power?  Or we should do that gradually 
> >> after this patch
> >> approved as it seems another independent issue?  Thanks:)
> >
> > If the code generated for 3)/4) isn't optimal you have to figure why
> > by tracing the RTL
> > expansion code and looking for missing optabs.
> >
> > Consider the amount of backend code you need to write if ever using
> > those in constexpr
> > context ...
>
> It seems too complicated to expand the "i = VIEW_CONVERT_EXPR<int[4]>(v)[_1];"
> or "VIEW_CONVERT_EXPR<int[4]>(v1)[_1] = i_6(D);" to
> rs6000_expand_vector_insert/rs6000_expand_vector_extract in RTL, as:
> 1) Vector v is stored to stack with array type; need extra load and store 
> operation.
> 2) Requires amount of code to decompose VIEW_CONVERT_EXPR to extract the 
> vector and
> index then call rs6000_expand_vector_insert/rs6000_expand_vector_extract.
>
> which means replace following instructions #9~#12 to new instruction 
> sequences:
>     1: NOTE_INSN_DELETED
>     6: NOTE_INSN_BASIC_BLOCK 2
>     2: r119:V4SI=%2:V4SI
>     3: r120:DI=%5:DI
>     4: r121:DI=%6:DI
>     5: NOTE_INSN_FUNCTION_BEG
>     8: [r112:DI]=r119:V4SI
>
>     9: r122:DI=r121:DI&0x3
>    10: r123:DI=r122:DI<<0x2
>    11: r124:DI=r112:DI+r123:DI
>    12: [r124:DI]=r120:DI#0
>
>    13: r118:V4SI=[r112:DI]
>    17: %2:V4SI=r118:V4SI
>    18: use %2:V4SI
>
> =>
>
>     1: NOTE_INSN_DELETED
>     6: NOTE_INSN_BASIC_BLOCK 2
>     2: r119:V4SI=%2:V4SI
>     3: r120:DI=%5:DI
>     4: r121:DI=%6:DI
>     5: NOTE_INSN_FUNCTION_BEG
>     8: [r112:DI]=r119:V4SI
>
> r130:V4SI=[r112:DI]
> rs6000_expand_vector_insert (r130,  r121:DI&0x3, r120:DI#0)
> [r112:DI]=r130:V4SI
>
>    13: r118:V4SI=[r112:DI]
>    17: %2:V4SI=r118:V4SI
>    18: use %2:V4SI
>
> so maybe bypass convert_vector_to_array_for_subscript for special circumstance
> like "i = v[n%4]" or "v[n&3]=i" to generate vec_extract or vec_insert builtin
> call a relative simpler method?

I think you have it backward.  You need to work with what
convert_vector_to_array_for_subscript
gives and deal with it during RTL expansion / optimization to generate
more optimal
code for power.  The goal is to have as little target specific
builtins during the GIMPLE
optimization phase (because we cannot work out its semantics in optimizers).

Richard.

>
> Xionghu

Reply via email to