RE: [RFC, vectorizer] Allow single element vector types for vector reduction operations

Tamar Christina Tue, 05 Sep 2017 03:25:26 -0700

Hi Richard,

That was an really interesting analysis, thanks for the details!


Would you be submitting the patch you proposed at the end as a fix?

Thanks,
Tamar

> -----Original Message-----
> From: Richard Biener [mailto:rguent...@suse.de]
> Sent: 05 September 2017 10:38
> To: Andrew Pinski
> Cc: Tamar Christina; Andreas Schwab; Jon Beniston; gcc-
> patc...@gcc.gnu.org; nd
> Subject: Re: [RFC, vectorizer] Allow single element vector types for vector
> reduction operations
> 
> On Mon, 4 Sep 2017, Andrew Pinski wrote:
> 
> > On Mon, Sep 4, 2017 at 7:28 AM, Tamar Christina
> <tamar.christ...@arm.com> wrote:
> > >> >   vect__5.25_58 = VIEW_CONVERT_EXPR<vector(1) long unsigned
> > >> intD.11>(vect__4.21_65);
> > >> >   vect__5.25_57 = VIEW_CONVERT_EXPR<vector(1) long unsigned
> > >> intD.11>(vect__4.22_63);
> > >> >   vect__5.25_56 = VIEW_CONVERT_EXPR<vector(1) long unsigned
> > >> intD.11>(vect__4.23_61);
> > >> >   vect__5.25_55 = VIEW_CONVERT_EXPR<vector(1) long unsigned
> > >> > intD.11>(vect__4.24_59);
> > >> >
> > >> > I suspect this patch will be quite bad for us performance wise as
> > >> > it thinks it's as cheap to do all our integer operations on the
> > >> > vector side with
> > >> vectors of 1 element. But I'm still waiting for the perf numbers to
> confirm.
> > >>
> > >> Looks like the backend advertises that it can do POPCOUNT on V1DI.
> > >> So SLP vectorization decides it can vectorize this without unrolling.
> > >
> > > We don't, POPCOUNT is only defined for vector modes V8QI and V16QI,
> > > we also don't define support For V1DI anywhere in the backend, we do
> > > however say we support V1DF, but removing That doesn't cause the ICE
> to go away.
> > >
> > >> Vectorization with V2DI is rejected:
> > >>
> > >> /tmp/trunk/gcc/testsuite/gcc.dg/vect/pr68577.c:18:3: note: function
> > >> is not vectorizable.
> > >> /tmp/trunk/gcc/testsuite/gcc.dg/vect/pr68577.c:18:3: note: not
> vectorized:
> > >> relevant stmt not supported: _8 = __builtin_popcountl (_5); ...
> > >> /tmp/trunk/gcc/testsuite/gcc.dg/vect/pr68577.c:18:3: note: *****
> > >> Re-trying analysis with vector size 8
> > >>
> > >> and that now succeeds (it probably didn't succeed before the patch).
> > >
> > > In the .optimized file, I see it vectorised it to
> > >
> > >   vect__5.25_58 = VIEW_CONVERT_EXPR<vector(1) long unsigned
> intD.11>(vect__4.21_65);
> > >   vect__5.25_57 = VIEW_CONVERT_EXPR<vector(1) long unsigned
> intD.11>(vect__4.22_63);
> > >   vect__5.25_56 = VIEW_CONVERT_EXPR<vector(1) long unsigned
> intD.11>(vect__4.23_61);
> > >   vect__5.25_55 = VIEW_CONVERT_EXPR<vector(1) long unsigned
> intD.11>(vect__4.24_59);
> > >   _54 = POPCOUNT (vect__5.25_58);
> > >   _53 = POPCOUNT (vect__5.25_57);
> > >
> > > Which is something we just don't have a pattern for. Before this patch, it
> was rejecting "long unsigned int"
> > > With this patch is somehow thinks we support an integer vector of 1
> > > element, even though 1) we don't have an optab Defined for this
> operation for POPCOUNT (or at all in aarch64 as far as I can tell), and 2) we
> don't have it in our supported list of vector modes.
> >
> > Here are the two popcount optab aarch64 has:
> > (define_mode_iterator GPI [SI DI])
> > (define_expand "popcount<mode>2"
> >   [(match_operand:GPI 0 "register_operand")
> >    (match_operand:GPI 1 "register_operand")]
> >
> >
> > (define_insn "popcount<mode>2"
> > (define_mode_iterator VB [V8QI V16QI])
> >   [(set (match_operand:VB 0 "register_operand" "=w")
> >         (popcount:VB (match_operand:VB 1 "register_operand" "w")))]
> >
> > As you can see we only define popcount optab for SI, DI, V8QI and
> > V16QI.  (note SI and DI uses the V8QI and V16QI during the expansion
> > but that is a different story).
> >
> > Maybe somehow the vectorizer is thinking V1DI and DI are
> > interchangeable in some places.
> 
> We ask
> 
> Breakpoint 5, vectorizable_internal_function
> (cfn=CFN_BUILT_IN_POPCOUNTL,
>     fndecl=<function_decl 0x7ffff694b500 __builtin_popcountl>,
>     vectype_out=<vector_type 0x7ffff69789d8>,
>     vectype_in=<vector_type 0x7ffff697a690>)
>     at /tmp/trunk/gcc/tree-vect-stmts.c:1666
> 1666      if (internal_fn_p (cfn))
> (gdb) p debug_generic_expr (vectype_out)
> vector(2) int
> $10 = void
> (gdb) p debug_generic_expr (vectype_in)
> vector(1) long unsigned int
> $11 = void
> (gdb) fin
> Run till exit from #0  vectorizable_internal_function (
>     cfn=CFN_BUILT_IN_POPCOUNTL,
>     fndecl=<function_decl 0x7ffff694b500 __builtin_popcountl>,
>     vectype_out=<vector_type 0x7ffff69789d8>,
>     vectype_in=<vector_type 0x7ffff697a690>)
>     at /tmp/trunk/gcc/tree-vect-stmts.c:1666
> 0x0000000001206afc in vectorizable_call (gs=<gimple_call 0x7ffff7fee7e0>,
>     gsi=0x0, vec_stmt=0x0, slp_node=0x2451290)
>     at /tmp/trunk/gcc/tree-vect-stmts.c:2762
> 2762        ifn = vectorizable_internal_function (cfn, callee,
> vectype_out,
> Value returned is $12 = IFN_POPCOUNT
> 
> so somehow direct_internal_fn_supported_p says true for a POPCOUNTL
> V1DI -> V2SI.  I'd argue the question is odd already but the ultimative answer
> is cleary wrong ;)
> 
> We have
> 
> (gdb) p info
> $22 = (const direct_internal_fn_info &) @0x19b8e0c: {type0 = 0, type1 = 0,
>   vectorizable = 1}
> 
> so require vectype_in as vectype_out which means we ask for V1DI -> V1DI
> (and the vectorizer compensates for the demotion).  But the mode of V1DI is
> actually DI (because the target doesn't support V1DI).
> Thus we end up with asking for popcount with DImode which is available.
> 
> Note that this V1DI vector type having DImode is desirable for Jons case as
> far as I understand.  DImode gets used because aarch64 advertises generic
> vectorization support with word_mode.
> 
> Things may get tricky later when we look for VEC_PACK_TRUNC of V1DI,
> V1DI -> V2SI but the vec_pack_trunc_optab advertises DImode support via
> the VDN iterator (maybe expecting this only in case no V2SI support is
> available).
> 
> So in the end the vectorizer works as expected ;)
> 
> What we don't seem to handle is a single-element vector typed, DImode
> constructor with a single DImode element.
> 
> We call store_bit_field with fieldmode DI but a value with V2SI, I guess 
> things
> go downhill from there.  The SSA exp we store was DImode.  Ah.
> 
>  <ssa_name 0x7ffff66c0438
>     type <integer_type 0x7ffff68a07e0 long unsigned int public unsigned DI
>         size <integer_cst 0x7ffff6891a98 constant 64>
>         unit-size <integer_cst 0x7ffff6891ab0 constant 8>
>         align:64 warn_if_not_align:0 symtab:0 alias-set 3 canonical-type
> 0x7ffff68a07e0 precision:64 min <integer_cst 0x7ffff6891d68 0> max
> <integer_cst 0x7ffff6893540 18446744073709551615>
>         pointer_to_this <pointer_type 0x7ffff68b1c78>>
>     visited
>     def_stmt _62 = VEC_PACK_TRUNC_EXPR <_67, _64>;
> 
> I suppose _that_'s already somewhat bogus.  vector lowering creates it for
> some reason:
> 
> -  vect__8.26_52 = VEC_PACK_TRUNC_EXPR <_54, _53>;
> +  _67 = BIT_FIELD_REF <_54, 64, 0>;
> +  _64 = BIT_FIELD_REF <_53, 64, 0>;
> +  _62 = VEC_PACK_TRUNC_EXPR <_67, _64>;
> +  _60 = {_62};
> +  _48 = VIEW_CONVERT_EXPR<vector(2) int>(_60);
> +  vect__8.26_52 = _48;
> 
> which happens because
> 
> static tree
> get_compute_type (enum tree_code code, optab op, tree type) {
>   /* For very wide vectors, try using a smaller vector mode.  */
>   tree compute_type = type;
>   if (op
>       && (!VECTOR_MODE_P (TYPE_MODE (type))
>           || optab_handler (op, TYPE_MODE (type)) == CODE_FOR_nothing))
> 
> and/or
> 
>   if (compute_type == NULL_TREE)
>     compute_type = get_compute_type (code, op, type);
>   if (compute_type == type)
>     return;
> 
> note that lowering sth like VEC_PACK_TRUNC_EXPR into scalars is pointless -
> at least in the way the lowering code does.
> 
> Index: gcc/tree-vect-generic.c
> ==========================================================
> =========
> --- gcc/tree-vect-generic.c   (revision 251642)
> +++ gcc/tree-vect-generic.c   (working copy)
> @@ -1439,7 +1439,7 @@ get_compute_type (enum tree_code code, o
>    /* For very wide vectors, try using a smaller vector mode.  */
>    tree compute_type = type;
>    if (op
> -      && (!VECTOR_MODE_P (TYPE_MODE (type))
> +      && (TYPE_MODE (type) == BLKmode
>         || optab_handler (op, TYPE_MODE (type)) == CODE_FOR_nothing))
>      {
>        tree vector_compute_type
> @@ -1459,7 +1459,7 @@ get_compute_type (enum tree_code code, o
>    if (compute_type == type)
>      {
>        machine_mode compute_mode = TYPE_MODE (compute_type);
> -      if (VECTOR_MODE_P (compute_mode))
> +      if (compute_mode != BLKmode)
>       {
>         if (op && optab_handler (op, compute_mode) !=
> CODE_FOR_nothing)
>           return compute_type;
> 
> fixes this and we pass the testcase again.  Note we vectorize that way even
> with the cost model enabled so it might show us that the cost model fails to
> account for some costs:
> 
> /tmp/trunk/gcc/testsuite/gcc.dg/vect/pr68577.c:18:3: note: Cost model
> analysis:
>   Vector inside of loop cost: 8
>   Vector prologue cost: 2
>   Vector epilogue cost: 0
>   Scalar iteration cost: 16
>   Scalar outside cost: 0
>   Vector outside cost: 2
>   prologue iterations: 0
>   epilogue iterations: 0
>   Calculated minimum iters for profitability: 1
> 
> SLP costing seems to fail to account for promotion/demotion costs it seems
> accounting for that still doesn't push it over the edge.
> 
> Richard.
> 
> > Thanks,
> > Andrew
> >
> >
> > >
> > > So I don't quite understand how we end up with this expression.
> > >
> > > Regards,
> > > Tamar
> > >
> > >>
> > >> Richard.
> > >>
> > >> > ________________________________________
> > >> > From: gcc-patches-ow...@gcc.gnu.org
> > >> > <gcc-patches-ow...@gcc.gnu.org>
> > >> on
> > >> > behalf of Andreas Schwab <sch...@linux-m68k.org>
> > >> > Sent: Saturday, September 2, 2017 10:58 PM
> > >> > To: Jon Beniston
> > >> > Cc: 'Richard Biener'; gcc-patches@gcc.gnu.org
> > >> > Subject: Re: [RFC, vectorizer] Allow single element vector types
> > >> > for vector reduction operations
> > >> >
> > >> > On Aug 30 2017, "Jon Beniston" <j...@beniston.com> wrote:
> > >> >
> > >> > > gcc/
> > >> > > 2017-08-30  Jon Beniston  <j...@beniston.com>
> > >> > >             Richard Biener  <rguent...@suse.de>
> > >> > >
> > >> > > diff --git a/gcc/tree-vect-patterns.c
> > >> > > b/gcc/tree-vect-patterns.c index cfdb72c..5ebeac2 100644
> > >> > > --- a/gcc/tree-vect-patterns.c
> > >> > > +++ b/gcc/tree-vect-patterns.c
> > >> > > @@ -4150,7 +4150,7 @@ vect_pattern_recog_1 (vect_recog_func
> > >> *recog_func,
> > >> > >    loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
> > >> > >
> > >> > >    if (VECTOR_BOOLEAN_TYPE_P (type_in)
> > >> > > -      || VECTOR_MODE_P (TYPE_MODE (type_in)))
> > >> > > +      || VECTOR_TYPE_P (type_in))
> > >> > >      {
> > >> > >        /* No need to check target support (already checked by the
> pattern
> > >> > >           recognition function).  */ diff --git
> > >> > > a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c index
> > >> > > 013fb1f..fc62efb 100644
> > >> > > --- a/gcc/tree-vect-stmts.c
> > >> > > +++ b/gcc/tree-vect-stmts.c
> > >> > > @@ -9098,7 +9098,8 @@ get_vectype_for_scalar_type_and_size
> > >> > > (tree scalar_type, unsigned size)
> > >> > >    else
> > >> > >      simd_mode = mode_for_vector (inner_mode, size / nbytes);
> > >> > >    nunits = GET_MODE_SIZE (simd_mode) / nbytes;
> > >> > > -  if (nunits <= 1)
> > >> > > +  /* NOTE: nunits == 1 is allowed to support single element vector
> types.
> > >> > > */
> > >> > > +  if (nunits < 1)
> > >> > >      return NULL_TREE;
> > >> > >
> > >> > >    vectype = build_vector_type (scalar_type, nunits);
> > >> > >
> > >> > >
> > >> > >
> > >> >
> > >> > That breaks vect/pr68577.c on aarch64.
> > >> >
> > >> > during RTL pass: expand
> > >> > /opt/gcc/gcc-20170902/gcc/testsuite/gcc.dg/vect/pr68577.c: In
> > >> > function
> > >> 'slp_test':
> > >> > /opt/gcc/gcc-20170902/gcc/testsuite/gcc.dg/vect/pr68577.c:20:12:
> > >> > internal compiler error: in simplify_subreg, at
> > >> > simplify-rtx.c:6050
> > >> > 0xb55983 simplify_subreg(machine_mode, rtx_def*, machine_mode,
> > >> unsigned int)
> > >> >         ../../gcc/simplify-rtx.c:6049
> > >> > 0xb595f7 simplify_gen_subreg(machine_mode, rtx_def*,
> > >> > machine_mode,
> > >> unsigned int)
> > >> >         ../../gcc/simplify-rtx.c:6278
> > >> > 0x81d277 store_bit_field_1
> > >> >         ../../gcc/expmed.c:798
> > >> > 0x81d55f store_bit_field(rtx_def*, unsigned long, unsigned long,
> > >> > unsigned
> > >> long, unsigned long, machine_mode, rtx_def*, bool)
> > >> >         ../../gcc/expmed.c:1133
> > >> > 0x840bf7 store_field
> > >> >         ../../gcc/expr.c:6950
> > >> > 0x84792f store_constructor_field
> > >> >         ../../gcc/expr.c:6142
> > >> > 0x83edbf store_constructor
> > >> >         ../../gcc/expr.c:6726
> > >> > 0x840443 expand_constructor
> > >> >         ../../gcc/expr.c:8027
> > >> > 0x82d5bf expand_expr_real_1(tree_node*, rtx_def*,
> machine_mode,
> > >> expand_modifier, rtx_def**, bool)
> > >> >         ../../gcc/expr.c:10133
> > >> > 0x82eaeb expand_expr_real_1(tree_node*, rtx_def*,
> machine_mode,
> > >> expand_modifier, rtx_def**, bool)
> > >> >         ../../gcc/expr.c:9819
> > >> > 0x82dadf expand_expr_real_1(tree_node*, rtx_def*,
> machine_mode,
> > >> expand_modifier, rtx_def**, bool)
> > >> >         ../../gcc/expr.c:10942
> > >> > 0x82eaeb expand_expr_real_1(tree_node*, rtx_def*,
> machine_mode,
> > >> expand_modifier, rtx_def**, bool)
> > >> >         ../../gcc/expr.c:9819
> > >> > 0x83d197 expand_expr
> > >> >         ../../gcc/expr.h:276
> > >> > 0x83d197 expand_assignment(tree_node*, tree_node*, bool)
> > >> >         ../../gcc/expr.c:4971
> > >> > 0x71e2f3 expand_gimple_stmt_1
> > >> >         ../../gcc/cfgexpand.c:3653
> > >> > 0x71e2f3 expand_gimple_stmt
> > >> >         ../../gcc/cfgexpand.c:3751 0x721cdb
> > >> > expand_gimple_basic_block
> > >> >         ../../gcc/cfgexpand.c:5750
> > >> > 0x726b07 execute
> > >> >         ../../gcc/cfgexpand.c:6357
> > >> >
> > >> > Andreas.
> > >> >
> > >> > --
> > >> > Andreas Schwab, sch...@linux-m68k.org GPG Key fingerprint = 58CA
> > >> > 54C7 6D53 942B 1756  01D3 44D5 214B 8276
> > >> > 4ED5 "And now for something completely different."
> > >> >
> > >> >
> > >>
> > >> --
> > >> Richard Biener <rguent...@suse.de>
> > >> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
> > >> Norton, HRB 21284 (AG Nuernberg)
> >
> >
> 
> --
> Richard Biener <rguent...@suse.de>
> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton,
> HRB 21284 (AG Nuernberg)

RE: [RFC, vectorizer] Allow single element vector types for vector reduction operations

Reply via email to