Hi Richard, That was an really interesting analysis, thanks for the details!
Would you be submitting the patch you proposed at the end as a fix? Thanks, Tamar > -----Original Message----- > From: Richard Biener [mailto:rguent...@suse.de] > Sent: 05 September 2017 10:38 > To: Andrew Pinski > Cc: Tamar Christina; Andreas Schwab; Jon Beniston; gcc- > patc...@gcc.gnu.org; nd > Subject: Re: [RFC, vectorizer] Allow single element vector types for vector > reduction operations > > On Mon, 4 Sep 2017, Andrew Pinski wrote: > > > On Mon, Sep 4, 2017 at 7:28 AM, Tamar Christina > <tamar.christ...@arm.com> wrote: > > >> > vect__5.25_58 = VIEW_CONVERT_EXPR<vector(1) long unsigned > > >> intD.11>(vect__4.21_65); > > >> > vect__5.25_57 = VIEW_CONVERT_EXPR<vector(1) long unsigned > > >> intD.11>(vect__4.22_63); > > >> > vect__5.25_56 = VIEW_CONVERT_EXPR<vector(1) long unsigned > > >> intD.11>(vect__4.23_61); > > >> > vect__5.25_55 = VIEW_CONVERT_EXPR<vector(1) long unsigned > > >> > intD.11>(vect__4.24_59); > > >> > > > >> > I suspect this patch will be quite bad for us performance wise as > > >> > it thinks it's as cheap to do all our integer operations on the > > >> > vector side with > > >> vectors of 1 element. But I'm still waiting for the perf numbers to > confirm. > > >> > > >> Looks like the backend advertises that it can do POPCOUNT on V1DI. > > >> So SLP vectorization decides it can vectorize this without unrolling. > > > > > > We don't, POPCOUNT is only defined for vector modes V8QI and V16QI, > > > we also don't define support For V1DI anywhere in the backend, we do > > > however say we support V1DF, but removing That doesn't cause the ICE > to go away. > > > > > >> Vectorization with V2DI is rejected: > > >> > > >> /tmp/trunk/gcc/testsuite/gcc.dg/vect/pr68577.c:18:3: note: function > > >> is not vectorizable. > > >> /tmp/trunk/gcc/testsuite/gcc.dg/vect/pr68577.c:18:3: note: not > vectorized: > > >> relevant stmt not supported: _8 = __builtin_popcountl (_5); ... > > >> /tmp/trunk/gcc/testsuite/gcc.dg/vect/pr68577.c:18:3: note: ***** > > >> Re-trying analysis with vector size 8 > > >> > > >> and that now succeeds (it probably didn't succeed before the patch). > > > > > > In the .optimized file, I see it vectorised it to > > > > > > vect__5.25_58 = VIEW_CONVERT_EXPR<vector(1) long unsigned > intD.11>(vect__4.21_65); > > > vect__5.25_57 = VIEW_CONVERT_EXPR<vector(1) long unsigned > intD.11>(vect__4.22_63); > > > vect__5.25_56 = VIEW_CONVERT_EXPR<vector(1) long unsigned > intD.11>(vect__4.23_61); > > > vect__5.25_55 = VIEW_CONVERT_EXPR<vector(1) long unsigned > intD.11>(vect__4.24_59); > > > _54 = POPCOUNT (vect__5.25_58); > > > _53 = POPCOUNT (vect__5.25_57); > > > > > > Which is something we just don't have a pattern for. Before this patch, it > was rejecting "long unsigned int" > > > With this patch is somehow thinks we support an integer vector of 1 > > > element, even though 1) we don't have an optab Defined for this > operation for POPCOUNT (or at all in aarch64 as far as I can tell), and 2) we > don't have it in our supported list of vector modes. > > > > Here are the two popcount optab aarch64 has: > > (define_mode_iterator GPI [SI DI]) > > (define_expand "popcount<mode>2" > > [(match_operand:GPI 0 "register_operand") > > (match_operand:GPI 1 "register_operand")] > > > > > > (define_insn "popcount<mode>2" > > (define_mode_iterator VB [V8QI V16QI]) > > [(set (match_operand:VB 0 "register_operand" "=w") > > (popcount:VB (match_operand:VB 1 "register_operand" "w")))] > > > > As you can see we only define popcount optab for SI, DI, V8QI and > > V16QI. (note SI and DI uses the V8QI and V16QI during the expansion > > but that is a different story). > > > > Maybe somehow the vectorizer is thinking V1DI and DI are > > interchangeable in some places. > > We ask > > Breakpoint 5, vectorizable_internal_function > (cfn=CFN_BUILT_IN_POPCOUNTL, > fndecl=<function_decl 0x7ffff694b500 __builtin_popcountl>, > vectype_out=<vector_type 0x7ffff69789d8>, > vectype_in=<vector_type 0x7ffff697a690>) > at /tmp/trunk/gcc/tree-vect-stmts.c:1666 > 1666 if (internal_fn_p (cfn)) > (gdb) p debug_generic_expr (vectype_out) > vector(2) int > $10 = void > (gdb) p debug_generic_expr (vectype_in) > vector(1) long unsigned int > $11 = void > (gdb) fin > Run till exit from #0 vectorizable_internal_function ( > cfn=CFN_BUILT_IN_POPCOUNTL, > fndecl=<function_decl 0x7ffff694b500 __builtin_popcountl>, > vectype_out=<vector_type 0x7ffff69789d8>, > vectype_in=<vector_type 0x7ffff697a690>) > at /tmp/trunk/gcc/tree-vect-stmts.c:1666 > 0x0000000001206afc in vectorizable_call (gs=<gimple_call 0x7ffff7fee7e0>, > gsi=0x0, vec_stmt=0x0, slp_node=0x2451290) > at /tmp/trunk/gcc/tree-vect-stmts.c:2762 > 2762 ifn = vectorizable_internal_function (cfn, callee, > vectype_out, > Value returned is $12 = IFN_POPCOUNT > > so somehow direct_internal_fn_supported_p says true for a POPCOUNTL > V1DI -> V2SI. I'd argue the question is odd already but the ultimative answer > is cleary wrong ;) > > We have > > (gdb) p info > $22 = (const direct_internal_fn_info &) @0x19b8e0c: {type0 = 0, type1 = 0, > vectorizable = 1} > > so require vectype_in as vectype_out which means we ask for V1DI -> V1DI > (and the vectorizer compensates for the demotion). But the mode of V1DI is > actually DI (because the target doesn't support V1DI). > Thus we end up with asking for popcount with DImode which is available. > > Note that this V1DI vector type having DImode is desirable for Jons case as > far as I understand. DImode gets used because aarch64 advertises generic > vectorization support with word_mode. > > Things may get tricky later when we look for VEC_PACK_TRUNC of V1DI, > V1DI -> V2SI but the vec_pack_trunc_optab advertises DImode support via > the VDN iterator (maybe expecting this only in case no V2SI support is > available). > > So in the end the vectorizer works as expected ;) > > What we don't seem to handle is a single-element vector typed, DImode > constructor with a single DImode element. > > We call store_bit_field with fieldmode DI but a value with V2SI, I guess > things > go downhill from there. The SSA exp we store was DImode. Ah. > > <ssa_name 0x7ffff66c0438 > type <integer_type 0x7ffff68a07e0 long unsigned int public unsigned DI > size <integer_cst 0x7ffff6891a98 constant 64> > unit-size <integer_cst 0x7ffff6891ab0 constant 8> > align:64 warn_if_not_align:0 symtab:0 alias-set 3 canonical-type > 0x7ffff68a07e0 precision:64 min <integer_cst 0x7ffff6891d68 0> max > <integer_cst 0x7ffff6893540 18446744073709551615> > pointer_to_this <pointer_type 0x7ffff68b1c78>> > visited > def_stmt _62 = VEC_PACK_TRUNC_EXPR <_67, _64>; > > I suppose _that_'s already somewhat bogus. vector lowering creates it for > some reason: > > - vect__8.26_52 = VEC_PACK_TRUNC_EXPR <_54, _53>; > + _67 = BIT_FIELD_REF <_54, 64, 0>; > + _64 = BIT_FIELD_REF <_53, 64, 0>; > + _62 = VEC_PACK_TRUNC_EXPR <_67, _64>; > + _60 = {_62}; > + _48 = VIEW_CONVERT_EXPR<vector(2) int>(_60); > + vect__8.26_52 = _48; > > which happens because > > static tree > get_compute_type (enum tree_code code, optab op, tree type) { > /* For very wide vectors, try using a smaller vector mode. */ > tree compute_type = type; > if (op > && (!VECTOR_MODE_P (TYPE_MODE (type)) > || optab_handler (op, TYPE_MODE (type)) == CODE_FOR_nothing)) > > and/or > > if (compute_type == NULL_TREE) > compute_type = get_compute_type (code, op, type); > if (compute_type == type) > return; > > note that lowering sth like VEC_PACK_TRUNC_EXPR into scalars is pointless - > at least in the way the lowering code does. > > Index: gcc/tree-vect-generic.c > ========================================================== > ========= > --- gcc/tree-vect-generic.c (revision 251642) > +++ gcc/tree-vect-generic.c (working copy) > @@ -1439,7 +1439,7 @@ get_compute_type (enum tree_code code, o > /* For very wide vectors, try using a smaller vector mode. */ > tree compute_type = type; > if (op > - && (!VECTOR_MODE_P (TYPE_MODE (type)) > + && (TYPE_MODE (type) == BLKmode > || optab_handler (op, TYPE_MODE (type)) == CODE_FOR_nothing)) > { > tree vector_compute_type > @@ -1459,7 +1459,7 @@ get_compute_type (enum tree_code code, o > if (compute_type == type) > { > machine_mode compute_mode = TYPE_MODE (compute_type); > - if (VECTOR_MODE_P (compute_mode)) > + if (compute_mode != BLKmode) > { > if (op && optab_handler (op, compute_mode) != > CODE_FOR_nothing) > return compute_type; > > fixes this and we pass the testcase again. Note we vectorize that way even > with the cost model enabled so it might show us that the cost model fails to > account for some costs: > > /tmp/trunk/gcc/testsuite/gcc.dg/vect/pr68577.c:18:3: note: Cost model > analysis: > Vector inside of loop cost: 8 > Vector prologue cost: 2 > Vector epilogue cost: 0 > Scalar iteration cost: 16 > Scalar outside cost: 0 > Vector outside cost: 2 > prologue iterations: 0 > epilogue iterations: 0 > Calculated minimum iters for profitability: 1 > > SLP costing seems to fail to account for promotion/demotion costs it seems > accounting for that still doesn't push it over the edge. > > Richard. > > > Thanks, > > Andrew > > > > > > > > > > So I don't quite understand how we end up with this expression. > > > > > > Regards, > > > Tamar > > > > > >> > > >> Richard. > > >> > > >> > ________________________________________ > > >> > From: gcc-patches-ow...@gcc.gnu.org > > >> > <gcc-patches-ow...@gcc.gnu.org> > > >> on > > >> > behalf of Andreas Schwab <sch...@linux-m68k.org> > > >> > Sent: Saturday, September 2, 2017 10:58 PM > > >> > To: Jon Beniston > > >> > Cc: 'Richard Biener'; gcc-patches@gcc.gnu.org > > >> > Subject: Re: [RFC, vectorizer] Allow single element vector types > > >> > for vector reduction operations > > >> > > > >> > On Aug 30 2017, "Jon Beniston" <j...@beniston.com> wrote: > > >> > > > >> > > gcc/ > > >> > > 2017-08-30 Jon Beniston <j...@beniston.com> > > >> > > Richard Biener <rguent...@suse.de> > > >> > > > > >> > > diff --git a/gcc/tree-vect-patterns.c > > >> > > b/gcc/tree-vect-patterns.c index cfdb72c..5ebeac2 100644 > > >> > > --- a/gcc/tree-vect-patterns.c > > >> > > +++ b/gcc/tree-vect-patterns.c > > >> > > @@ -4150,7 +4150,7 @@ vect_pattern_recog_1 (vect_recog_func > > >> *recog_func, > > >> > > loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info); > > >> > > > > >> > > if (VECTOR_BOOLEAN_TYPE_P (type_in) > > >> > > - || VECTOR_MODE_P (TYPE_MODE (type_in))) > > >> > > + || VECTOR_TYPE_P (type_in)) > > >> > > { > > >> > > /* No need to check target support (already checked by the > pattern > > >> > > recognition function). */ diff --git > > >> > > a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c index > > >> > > 013fb1f..fc62efb 100644 > > >> > > --- a/gcc/tree-vect-stmts.c > > >> > > +++ b/gcc/tree-vect-stmts.c > > >> > > @@ -9098,7 +9098,8 @@ get_vectype_for_scalar_type_and_size > > >> > > (tree scalar_type, unsigned size) > > >> > > else > > >> > > simd_mode = mode_for_vector (inner_mode, size / nbytes); > > >> > > nunits = GET_MODE_SIZE (simd_mode) / nbytes; > > >> > > - if (nunits <= 1) > > >> > > + /* NOTE: nunits == 1 is allowed to support single element vector > types. > > >> > > */ > > >> > > + if (nunits < 1) > > >> > > return NULL_TREE; > > >> > > > > >> > > vectype = build_vector_type (scalar_type, nunits); > > >> > > > > >> > > > > >> > > > > >> > > > >> > That breaks vect/pr68577.c on aarch64. > > >> > > > >> > during RTL pass: expand > > >> > /opt/gcc/gcc-20170902/gcc/testsuite/gcc.dg/vect/pr68577.c: In > > >> > function > > >> 'slp_test': > > >> > /opt/gcc/gcc-20170902/gcc/testsuite/gcc.dg/vect/pr68577.c:20:12: > > >> > internal compiler error: in simplify_subreg, at > > >> > simplify-rtx.c:6050 > > >> > 0xb55983 simplify_subreg(machine_mode, rtx_def*, machine_mode, > > >> unsigned int) > > >> > ../../gcc/simplify-rtx.c:6049 > > >> > 0xb595f7 simplify_gen_subreg(machine_mode, rtx_def*, > > >> > machine_mode, > > >> unsigned int) > > >> > ../../gcc/simplify-rtx.c:6278 > > >> > 0x81d277 store_bit_field_1 > > >> > ../../gcc/expmed.c:798 > > >> > 0x81d55f store_bit_field(rtx_def*, unsigned long, unsigned long, > > >> > unsigned > > >> long, unsigned long, machine_mode, rtx_def*, bool) > > >> > ../../gcc/expmed.c:1133 > > >> > 0x840bf7 store_field > > >> > ../../gcc/expr.c:6950 > > >> > 0x84792f store_constructor_field > > >> > ../../gcc/expr.c:6142 > > >> > 0x83edbf store_constructor > > >> > ../../gcc/expr.c:6726 > > >> > 0x840443 expand_constructor > > >> > ../../gcc/expr.c:8027 > > >> > 0x82d5bf expand_expr_real_1(tree_node*, rtx_def*, > machine_mode, > > >> expand_modifier, rtx_def**, bool) > > >> > ../../gcc/expr.c:10133 > > >> > 0x82eaeb expand_expr_real_1(tree_node*, rtx_def*, > machine_mode, > > >> expand_modifier, rtx_def**, bool) > > >> > ../../gcc/expr.c:9819 > > >> > 0x82dadf expand_expr_real_1(tree_node*, rtx_def*, > machine_mode, > > >> expand_modifier, rtx_def**, bool) > > >> > ../../gcc/expr.c:10942 > > >> > 0x82eaeb expand_expr_real_1(tree_node*, rtx_def*, > machine_mode, > > >> expand_modifier, rtx_def**, bool) > > >> > ../../gcc/expr.c:9819 > > >> > 0x83d197 expand_expr > > >> > ../../gcc/expr.h:276 > > >> > 0x83d197 expand_assignment(tree_node*, tree_node*, bool) > > >> > ../../gcc/expr.c:4971 > > >> > 0x71e2f3 expand_gimple_stmt_1 > > >> > ../../gcc/cfgexpand.c:3653 > > >> > 0x71e2f3 expand_gimple_stmt > > >> > ../../gcc/cfgexpand.c:3751 0x721cdb > > >> > expand_gimple_basic_block > > >> > ../../gcc/cfgexpand.c:5750 > > >> > 0x726b07 execute > > >> > ../../gcc/cfgexpand.c:6357 > > >> > > > >> > Andreas. > > >> > > > >> > -- > > >> > Andreas Schwab, sch...@linux-m68k.org GPG Key fingerprint = 58CA > > >> > 54C7 6D53 942B 1756 01D3 44D5 214B 8276 > > >> > 4ED5 "And now for something completely different." > > >> > > > >> > > > >> > > >> -- > > >> Richard Biener <rguent...@suse.de> > > >> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham > > >> Norton, HRB 21284 (AG Nuernberg) > > > > > > -- > Richard Biener <rguent...@suse.de> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, > HRB 21284 (AG Nuernberg)