On Tue, Sep 15, 2020 at 5:56 AM luoxhu <luo...@linux.ibm.com> wrote: > > > > On 2020/9/14 17:47, Richard Biener wrote: > > On Mon, Sep 14, 2020 at 10:05 AM luoxhu <luo...@linux.ibm.com> wrote: > > >> Not sure whether this reflects the issues you discussed above. > >> I constructed below test cases and tested with and without this patch, > >> only if "a+c"(which means store only), the performance is getting bad with > >> this patch due to extra load/store(about 50% slower). > >> For "v = vec_insert (i, v, n);" usage, v is always loaded after store, so > >> this > >> patch will always get big benefit. > >> But for "v[n % 4] = i;" usage, it depends on whether "v" is used > >> immediately > >> inside the function or out of the function soon. Does this mean unify the > >> two > >> usage to same gimple code not a good idea sometimes? Or is it possible to > >> detect the generated IFN ".VEC_INSERT (&v, i_4(D), _1);" destination "v" is > >> used not far away inside or out side of the function? > >> > >> > >> #define TYPE int > >> > >> vector TYPE v = {1, 2, 3, 4}; // a. global vector. > >> vector TYPE s = {4, 3, 2, 1}; > >> > >> __attribute__ ((noinline)) > >> vector TYPE > >> test (vector TYPE u, TYPE i, size_t n) > >> { > >> v[n % 4] = i; > > > > ^^^ > > > > this should be > > > > u[n % 4] = i; > > > > I guess. Is the % 4 mandated by the vec_insert semantics btw? > > Yes. Segher pasted the builtin description in his reply. "v = vec_insert > (i, u, n);" > is a bit different with "u[n % 4] = i;" since it returns a copy of u instead > of modify > u. The adjust in lower __builtin_vec_insert_xxx gimple will make a copy of u > first to > meet the requirements. > > > > > If you tested with the above error you probably need to re-measure. > > No I did test for u as local instead of global before. If u is not used very > soon, the performance is almost the same for generating single store or > IFN_SET > of inserting with variable index. > > source: > __attribute__ ((noinline)) vector TYPE > test (vector TYPE u, TYPE i, size_t n) > { > u[n % 4] = i; > vector TYPE r = {0}; > for (long k = 0; k < 100; k++) > { > r += v; > } > return u+r; > } > > => store hit load is relieved due to long distance. > > ASM: > 0: addis 2,12,.TOC.-.LCF0@ha > addi 2,2,.TOC.-.LCF0@l > .localentry test,.-test > addis 10,2,.LANCHOR0@toc@ha > li 8,50 > xxspltib 32,0 > addi 9,1,-16 > rldic 6,6,2,60 > stxv 34,-16(1) > addi 10,10,.LANCHOR0@toc@l > mtctr 8 > xxlor 33,32,32 > stwx 5,9,6 // short store > lxv 45,0(10) > .p2align 4,,15 > .L2: > vadduwm 0,0,13 > vadduwm 1,1,13 > bdnz .L2 > vadduwm 0,0,1 > lxv 33,-16(1) // wide load > vadduwm 2,0,1 > blr > > > Then I intend to use "v" there for global memory insert test to separate > the store and load into different function ("v" is stored in function, > but loaded out of function will get different performance for single store > or lxv+xxperm+xxsel+stxv, the inner function doesn't know the distance > between load and store across functions). > > Do you mean that we don't need generate IFN_SET if "v" is global memory?
Yes. > I only see VAR_DECL and PARM_DECL, is there any function to check the tree > variable is global? I added DECL_REGISTER, but the RTL still expands to stack: is_global_var () or alternatively !auto_var_in_fn_p (), I think doing IFN_SET only makes sense if there's the chance we can promote the variable to a register. But it would be an incorrect transform (it stores the whole vector) if the vector storage could "escape" to another thread - which means you probably have to check !TREE_ADDRESSABLE as well. > > gcc/internal-fn.c: rtx to_rtx = expand_expr (view_op0, NULL_RTX, VOIDmode, > EXPAND_WRITE); > > (gdb) p view_op0 > $584 = <var_decl 0x7ffff7f705a0> > (gdb) p DECL_REGISTER(view_op0) > $585 = 1 > (gdb) p to_rtx > $586 = (rtx_def *) (mem/c:V4SI (reg/f:DI 112 virtual-stack-vars) [1 D.3190+0 > S16 A128]) > > > > > > On gimple the above function (after fixing it) looks like > > > > VIEW_CONVERT_EXPR<int[4]>(u)[_1] = i_4(D); > > > > and the IFN idea I had would - for non-global memory 'u' only - transform > > this to > > > > vector_register_2 = u; > > vector_register_3 = .IFN_VEC_SET (vector_register_2, _1, i_4(D)); > > u = vector_register_3; > > > > if vec_set can handle variable indexes. This then becomes a > > vec_set on a register and if that was the only variable indexing of 'u' > > will also cause 'u' to be expanded as register rather than stack memory. > > > > Note we can't use the direct-optab method here since the vec_set optab > > modifies operand 0 which isn't possible in SSA form. That might hint > > at that we eventually want to extend BIT_INSERT_EXPR to handle > > a non-constant bit position but for experiments using an alternate > > internal function is certainly easier. > > > > My current implementation does: > > 1) v = vec_insert (i, u, n); > > =>gimple: > { > register __vector signed int D.3190; > D.3190 = u; // *new decl and copy u first.* > _1 = n & 3; > VIEW_CONVERT_EXPR<int[4]>(D.3190)[_1] = i; // *update op0 of > VIEW_CONVERT_EXPR* > _2 = D.3190; > ... > } > > =>isel: > { > register __vector signed int D.3190; > D.3190 = u_4(D); > _1 = n_6(D) & 3; > .VEC_SET (&D.3190, i_7(D), _1); why are you passing the address of D.3190 to .VEC_SET? That will not make D.3190 be expanded to a pseudo. You really need to have GIMPLE registers here (SSA name) and thus a return value, leaving the argument unmodified D.3190_3 = .VEC_SET (D.3190_4, i_7(D), _1); note this is why I asked about the actual CPU instruction - as I read Seghers mail the instruction modifies a vector register, not memory. > _2 = D.3190; > ... > } > > > 2) u[n % 4] = i; > > =>gimple: > { > __vector signed int D.3191; > _1 = n & 3; > VIEW_CONVERT_EXPR<int[4]>(u)[_1] = i; // *update op0 of VIEW_CONVERT_EXPR* > D.3191 = u; > ... > } > > =>isel: > { > D.3190 = u_4(D); > _1 = n_6(D) & 3; > .VEC_SET (&D.3191, i_7(D), _1); > _2 = D.3190; > v = _2; > } > > The IFN ".VEC_SET" behavior quite like the other IFN STORE functions and > doesn't > require a dest operand to be set? Both 1) and 2) are modifying operand 0 of > VIEW_CONVERT_EXPR just like vec_set optab. > > Attached the IFN vec_set patch part, the expand part is moved from > expr.c:expand_assignment > to internal-fn.c:expand_vec_set_optab_fn now. > > > Thanks, > Xionghu