On Tue, Sep 15, 2020 at 5:56 AM luoxhu <luo...@linux.ibm.com> wrote:
>
>
>
> On 2020/9/14 17:47, Richard Biener wrote:
> > On Mon, Sep 14, 2020 at 10:05 AM luoxhu <luo...@linux.ibm.com> wrote:
>
> >> Not sure whether this reflects the issues you discussed above.
> >> I constructed below test cases and tested with and without this patch,
> >> only if "a+c"(which means store only), the performance is getting bad with
> >> this patch due to extra load/store(about 50% slower).
> >> For "v = vec_insert (i, v, n);" usage, v is always loaded after store, so 
> >> this
> >> patch will always get big benefit.
> >> But for "v[n % 4] = i;" usage, it depends on whether "v" is used 
> >> immediately
> >> inside the function or out of the function soon.  Does this mean unify the 
> >> two
> >> usage to same gimple code not a good idea sometimes?  Or is it possible to
> >> detect the generated IFN ".VEC_INSERT (&v, i_4(D), _1);" destination "v" is
> >> used not far away inside or out side of the function?
> >>
> >>
> >> #define TYPE int
> >>
> >> vector TYPE v = {1, 2, 3, 4};   // a. global vector.
> >> vector TYPE s = {4, 3, 2, 1};
> >>
> >> __attribute__ ((noinline))
> >> vector TYPE
> >> test (vector TYPE u, TYPE i, size_t n)
> >> {
> >>    v[n % 4] = i;
> >
> > ^^^
> >
> > this should be
> >
> >     u[n % 4] = i;
> >
> > I guess.  Is the % 4 mandated by the vec_insert semantics btw?
>
> Yes.  Segher pasted the builtin description in his reply.  "v = vec_insert 
> (i, u, n);"
> is a bit different with "u[n % 4] = i;" since it returns a copy of u instead 
> of modify
> u. The adjust in lower __builtin_vec_insert_xxx gimple will make a copy of u 
> first to
> meet the requirements.
>
> >
> > If you tested with the above error you probably need to re-measure.
>
> No I did test for u as local instead of global before.  If u is not used very
> soon, the performance is almost the same for generating single store or 
> IFN_SET
> of inserting with variable index.
>
> source:
> __attribute__ ((noinline)) vector TYPE
> test (vector TYPE u, TYPE i, size_t n)
> {
>   u[n % 4] = i;
>   vector TYPE r = {0};
>   for (long k = 0; k < 100; k++)
>     {
>       r += v;
>     }
>   return u+r;
> }
>
> => store hit load is relieved due to long distance.
>
> ASM:
> 0:      addis 2,12,.TOC.-.LCF0@ha
>         addi 2,2,.TOC.-.LCF0@l
>         .localentry     test,.-test
>         addis 10,2,.LANCHOR0@toc@ha
>         li 8,50
>         xxspltib 32,0
>         addi 9,1,-16
>         rldic 6,6,2,60
>         stxv 34,-16(1)
>         addi 10,10,.LANCHOR0@toc@l
>         mtctr 8
>         xxlor 33,32,32
>         stwx 5,9,6      // short store
>         lxv 45,0(10)
>         .p2align 4,,15
> .L2:
>         vadduwm 0,0,13
>         vadduwm 1,1,13
>         bdnz .L2
>         vadduwm 0,0,1
>         lxv 33,-16(1)   // wide load
>         vadduwm 2,0,1
>         blr
>
>
> Then I intend to use "v" there for global memory insert test to separate
> the store and load into different function ("v" is stored in function,
> but loaded out of function will get different performance for single store
> or lxv+xxperm+xxsel+stxv, the inner function doesn't know the distance
> between load and store across functions).
>
> Do you mean that we don't need generate IFN_SET if "v" is global memory?

Yes.

> I only see VAR_DECL and PARM_DECL, is there any function to check the tree
> variable is global? I added DECL_REGISTER, but the RTL still expands to stack:

is_global_var () or alternatively !auto_var_in_fn_p (), I think doing
IFN_SET only
makes sense if there's the chance we can promote the variable to a
register.  But it
would be an incorrect transform (it stores the whole vector) if the
vector storage
could "escape" to another thread - which means you probably have to check
!TREE_ADDRESSABLE as well.

>
> gcc/internal-fn.c: rtx to_rtx = expand_expr (view_op0, NULL_RTX, VOIDmode, 
> EXPAND_WRITE);
>
> (gdb) p view_op0
> $584 = <var_decl 0x7ffff7f705a0>
> (gdb) p DECL_REGISTER(view_op0)
> $585 = 1
> (gdb) p to_rtx
> $586 = (rtx_def *) (mem/c:V4SI (reg/f:DI 112 virtual-stack-vars) [1 D.3190+0 
> S16 A128])
>
>
> >
> > On gimple the above function (after fixing it) looks like
> >
> >    VIEW_CONVERT_EXPR<int[4]>(u)[_1] = i_4(D);
> >
> > and the IFN idea I had would - for non-global memory 'u' only - transform
> > this to
> >
> >    vector_register_2 = u;
> >    vector_register_3 = .IFN_VEC_SET (vector_register_2, _1, i_4(D));
> >    u = vector_register_3;
> >
> > if vec_set can handle variable indexes.  This then becomes a
> > vec_set on a register and if that was the only variable indexing of 'u'
> > will also cause 'u' to be expanded as register rather than stack memory.
> >
> > Note we can't use the direct-optab method here since the vec_set optab
> > modifies operand 0 which isn't possible in SSA form.  That might hint
> > at that we eventually want to extend BIT_INSERT_EXPR to handle
> > a non-constant bit position but for experiments using an alternate
> > internal function is certainly easier.
> >
>
> My current implementation does:
>
> 1)  v = vec_insert (i, u, n);
>
> =>gimple:
> {
>   register __vector signed int D.3190;
>   D.3190 = u;            // *new decl and copy u first.*
>   _1 = n & 3;
>   VIEW_CONVERT_EXPR<int[4]>(D.3190)[_1] = i;   // *update op0 of 
> VIEW_CONVERT_EXPR*
>   _2 = D.3190;
>   ...
> }
>
> =>isel:
> {
>   register __vector signed int D.3190;
>   D.3190 = u_4(D);
>   _1 = n_6(D) & 3;
>   .VEC_SET (&D.3190, i_7(D), _1);

why are you passing the address of D.3190 to .VEC_SET?  That will not
make D.3190
be expanded to a pseudo.   You really need to have GIMPLE registers
here (SSA name)
and thus a return value, leaving the argument unmodified

  D.3190_3 = .VEC_SET (D.3190_4, i_7(D), _1);

note this is why I asked about the actual CPU instruction - as I read
Seghers mail
the instruction modifies a vector register, not memory.

>   _2 = D.3190;
>   ...
> }
>
>
> 2) u[n % 4] = i;
>
> =>gimple:
> {
>   __vector signed int D.3191;
>   _1 = n & 3;
>   VIEW_CONVERT_EXPR<int[4]>(u)[_1] = i;   // *update op0 of VIEW_CONVERT_EXPR*
>   D.3191 = u;
>  ...
> }
>
> =>isel:
> {
>   D.3190 = u_4(D);
>   _1 = n_6(D) & 3;
>   .VEC_SET (&D.3191, i_7(D), _1);
>   _2 = D.3190;
>   v = _2;
> }
>
> The IFN ".VEC_SET" behavior quite like the other IFN STORE functions and 
> doesn't
> require a dest operand to be set?  Both 1) and 2) are modifying operand 0 of
> VIEW_CONVERT_EXPR just like vec_set optab.
>
> Attached the IFN vec_set patch part, the expand part is moved from 
> expr.c:expand_assignment
> to internal-fn.c:expand_vec_set_optab_fn now.
>
>
> Thanks,
> Xionghu

Reply via email to