RE: [PATCH 1/4]middle-end: support multi-step zero-extends using VEC_PERM_EXPR

Richard Biener Tue, 15 Oct 2024 05:19:58 -0700

On Tue, 15 Oct 2024, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <rguent...@suse.de>
> > Sent: Tuesday, October 15, 2024 12:13 PM
> > To: Tamar Christina <tamar.christ...@arm.com>
> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>
> > Subject: Re: [PATCH 1/4]middle-end: support multi-step zero-extends using
> > VEC_PERM_EXPR
> > 
> > On Tue, 15 Oct 2024, Tamar Christina wrote:
> > 
> > > Hi,
> > >
> > > Thanks for the look,
> > >
> > > The 10/15/2024 09:54, Richard Biener wrote:
> > > > On Mon, 14 Oct 2024, Tamar Christina wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > This patch series adds support for a target to do a direct convertion 
> > > > > for zero
> > > > > extends using permutes.
> > > > >
> > > > > To do this it uses a target hook use_permute_for_promotio which must 
> > > > > be
> > > > > implemented by targets.  This hook is used to indicate:
> > > > >
> > > > >  1. can a target do this for the given modes.
> > > >
> > > > can_vec_perm_const_p?
> > > >
> > > > >  3. can the target convert between various vector modes with a
> > VIEW_CONVERT.
> > > >
> > > > We have modes_tieable_p for this I think.
> > > >
> > >
> > > Yes, though the reason I didn't use either of them was because they are 
> > > reporting
> > > a capability of the backend.  In which case the hook, which is already 
> > > backend
> > > specific already should answer these two.
> > >
> > > I initially had these checks there, but they didn't seem to add value, for
> > > promotions the masks are only dependent on the input and output modes. So
> > they really
> > > don't change.
> > >
> > > When you have say a loop that does lots of conversions from say char to 
> > > int, it
> > seemed
> > > like a waste to retest the same permute constants over and over again.
> > >
> > > I can add them back in if you prefer...
> > >
> > > > >  2. is it profitable for the target to do it.
> > > >
> > > > So you say the target can do both ways but both zip and tbl are
> > > > permute instructions so I really fail to see the point and why
> > > > the target itself doesn't choose to use tbl for unpack.
> > > >
> > > > Is the intent in the end to have VEC_PERM in the IL rather than
> > > > VEC_UNPACK_* so it combines with other VEC_PERMs?
> > > >
> > >
> > > Yes, and this happens quite often, e.g. load permutes or lane shuffles 
> > > etc.
> > > The reason for exposing them as VEC_PERM was to trigger further 
> > > optimizations.
> > >
> > > If you remember the ticket about LOAD_LANES, with this optimization and an
> > open
> > > encoding of LOAD_LANES we stop using it in cases where theres a zero 
> > > extend
> > after
> > > the LOAD_LANES, because then you're doing effectively two permutes and the
> > LOAD_LANES
> > > is no longer beneficial. There are other examples, load and replicate etc.
> > >
> > > > That said, I'm not against supporting VEC_PERM code gen from
> > > > unsigned promotion but I don't see why we should do this when
> > > > the target advertises VEC_UNPACK_* support or direct conversion
> > > > support?
> > > >
> > > > Esp. with adding a "local" cost related hook which cannot take
> > > > into accout context.
> > > >
> > >
> > > To summarize a long story:
> > >
> > >   yes I open encode zero extends as permutes to allow further 
> > > optimizations.
> > One could convert
> > >   vec_unpacks to convert optabs and use that, but that is an opague value 
> > > that
> > can't be further
> > >   optimized.
> > >
> > >   The hook isn't really a costing thing in the general sense. It's 
> > > literally just "do you
> > want
> > >   permutes yes or no".  The reason it gets the modes is simply that I 
> > > don't think a
> > single level
> > >   extend is worth it, but I can just change it to never try to do this on 
> > > more than
> > one level.
> > 
> > When you mention LOAD_LANES we do not expose "permutes" in them on
> > GIMPLE
> > either, so why should we for VEC_UNPACK_*.
> 
> I think not exposing LOAD_LANES in GIMPLE *is* an actual mistake that I hope 
> to correct in GCC-16.
> Or at least the time we pick LOAD_LANES is too early.  So I don't think 
> pointing to this is a convincing
> argument.  It's only VLA that I think needs the IL because you have to mask 
> the group of operations and
> may be hard to reconcile that later on.
> 
> > At what level are the simplifications you see happening then?
> 
> Well, they are currently happening outside of the vectorizer passes itself,
> more specifically in this case because VN runs match simplifications.


But match doesn't simplify permutes against .LOAD_LANES?  So it's about
"other" permutes (from loads) that get simplified?

> If the concern is that that's late I can lift it to a pattern I suppose.
> I didn't use a pattern because similar changes in this area always just 
> happened
> at codegen.

I was wondering how this plays with my idea of having us "lower"
or rather "code generate" to an intermediate SLP representation where
we split SLP groups on vector boundaries and are then free to
perform permute optimizations that need to know the vector type.

That said - match could as well combine VEC_UNPACK_* with a VEC_PERMUTE
with the catch that this duplicates patterns for the 
VEC_UNPACK_*/VEC_PERMUTE duality we have.

> > 
> > I do realize we have two ways of expressing zero-extending widenings
> > (also truncations btw) and that's always bad - so we could decide to
> > _always_ use VEC_PERMs as the canonical representation because those
> > combine more easily.  And either match VEC_PERMs back to vec_unpack
> > at RTL expansion time or require targets to expose those as constant
> > vec_perms as well.  There are targets like GCN where you can't do
> > unpacking with permutes of course, so we can't do away with them
> > (we could possibly force those targets to expose widening/truncation
> > solely with [us]ext and trunc patterns of course).
> 
> Ok, so your objection is that you don't want to have a different way of doing
> a single step zero extend vs a multi-step zero extend.

My objection is mainly that we do this based on a target decision and
without immediate effect on the vector loop and its costing - it's not
that we are then able to see we can combine the permutes with others,
say in SLP permute optimization.

> At the moment my patch doesn't care, if you return an unconditional true
> then for that target you get VEC_PERM or everything and the vectorizer
> won't ever spit out VEC_UNPACKU.
> 
> You're arguing that this should be the default, even if the target does not
> support it and then we have to somehow undo it during vec_lowering?

I argued that we possibly should do this by default and all targets
that can vec_unpack but not vec_perm_const with such a permute can
either implement the missing vec_perm_const or they are of the kind
that cannot use a permute for this (!modes_tieable_p).

> Otherwise if the target doesn't support the permute it'll be scalarized..
> 
> I guess sure..  But then...
> 
> > There are targets like GCN where you can't do
> > unpacking with permutes of course, so we can't do away with them
> > (we could possibly force those targets to expose widening/truncation
> > solely with [us]ext and trunc patterns of course).
> 
> I guess if can_vec_perm_const_p fails we can undo it.. But it feels like
> we lose an element of preference here.  A target *could* do the permute,
> but not do it efficiently.

It can do it the same way it would do the vec_unpack?  Or what am I
missing here?  Does your permute not exactly replicate vec_unpack_lo/hi?

> > 
> > > I think think there's a lot of merrit in open-encoding zero extends, but 
> > > one
> > reason this is
> > > beneficial on AArch64 for instance is that we can consume the zero 
> > > register and
> > rewrite the
> > > indices to a single register TBL.  Two registers TBLs are slower on some
> > implementations.
> > 
> > But this latter fact can be done by optimizing the RTL?
> 
> Sure, and we do so today.  That's why the example output in the cover letter
> has only one input register.  The point of this blurb was to point out more 
> that
> the optimization being beneficial may depend on a specific uarch and as such
> I believe that a certain element of target buy in is needed.

If it's dependent on uarch then even more so - why not simply
expand vec_unpack as tbl then?

> If you want me to do it unconditionally sure, I can do that...
> 
> If so can I get a review on the other patches anyway? They are 
> independent mostly. Only have some dependencies on the output of the 
> tests.

Sure, I'm behind stuff - sorry.

Richard.

> Thanks,
> Tamar
> 
> > 
> > Richard.
> > 
> > > Thanks,
> > > Tamar
> > >
> > > > > Using permutations have a big benefit of multi-step zero extensions 
> > > > > because
> > they
> > > > > both reduce the number of needed instructions, but also increase 
> > > > > throughput
> > as
> > > > > the dependency chain is removed.
> > > > >
> > > > > Concretely on AArch64 this changes:
> > > > >
> > > > > void test4(unsigned char *x, long long *y, int n) {
> > > > >     for(int i = 0; i < n; i++) {
> > > > >         y[i] = x[i];
> > > > >     }
> > > > > }
> > > > >
> > > > > from generating:
> > > > >
> > > > > .L4:
> > > > >         ldr     q30, [x4], 16
> > > > >         add     x3, x3, 128
> > > > >         zip1    v1.16b, v30.16b, v31.16b
> > > > >         zip2    v30.16b, v30.16b, v31.16b
> > > > >         zip1    v2.8h, v1.8h, v31.8h
> > > > >         zip1    v0.8h, v30.8h, v31.8h
> > > > >         zip2    v1.8h, v1.8h, v31.8h
> > > > >         zip2    v30.8h, v30.8h, v31.8h
> > > > >         zip1    v26.4s, v2.4s, v31.4s
> > > > >         zip1    v29.4s, v0.4s, v31.4s
> > > > >         zip1    v28.4s, v1.4s, v31.4s
> > > > >         zip1    v27.4s, v30.4s, v31.4s
> > > > >         zip2    v2.4s, v2.4s, v31.4s
> > > > >         zip2    v0.4s, v0.4s, v31.4s
> > > > >         zip2    v1.4s, v1.4s, v31.4s
> > > > >         zip2    v30.4s, v30.4s, v31.4s
> > > > >         stp     q26, q2, [x3, -128]
> > > > >         stp     q28, q1, [x3, -96]
> > > > >         stp     q29, q0, [x3, -64]
> > > > >         stp     q27, q30, [x3, -32]
> > > > >         cmp     x4, x5
> > > > >         bne     .L4
> > > > >
> > > > > and instead we get:
> > > > >
> > > > > .L4:
> > > > >         add     x3, x3, 128
> > > > >         ldr     q23, [x4], 16
> > > > >         tbl     v5.16b, {v23.16b}, v31.16b
> > > > >         tbl     v4.16b, {v23.16b}, v30.16b
> > > > >         tbl     v3.16b, {v23.16b}, v29.16b
> > > > >         tbl     v2.16b, {v23.16b}, v28.16b
> > > > >         tbl     v1.16b, {v23.16b}, v27.16b
> > > > >         tbl     v0.16b, {v23.16b}, v26.16b
> > > > >         tbl     v22.16b, {v23.16b}, v25.16b
> > > > >         tbl     v23.16b, {v23.16b}, v24.16b
> > > > >         stp     q5, q4, [x3, -128]
> > > > >         stp     q3, q2, [x3, -96]
> > > > >         stp     q1, q0, [x3, -64]
> > > > >         stp     q22, q23, [x3, -32]
> > > > >         cmp     x4, x5
> > > > >         bne     .L4
> > > > >
> > > > > Tests are added in the AArch64 patch introducing the hook.  The 
> > > > > testsuite also
> > > > > already had about 800 runtime tests that get affected by this.
> > > > >
> > > > > Bootstrapped Regtested on aarch64-none-linux-gnu, arm-none-linux-
> > gnueabihf,
> > > > > x86_64-pc-linux-gnu -m32, -m64 and no issues.
> > > > >
> > > > > Ok for master?
> > > > >
> > > > > Thanks,
> > > > > Tamar
> > > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > >       * target.def (use_permute_for_promotion): New.
> > > > >       * doc/tm.texi.in: Document it.
> > > > >       * doc/tm.texi: Regenerate.
> > > > >       * targhooks.cc (default_use_permute_for_promotion): New.
> > > > >       * targhooks.h (default_use_permute_for_promotion): New.
> > > > >       (vectorizable_conversion): Support direct convertion with 
> > > > > permute.
> > > > >       * tree-vect-stmts.cc (vect_create_vectorized_promotion_stmts): 
> > > > > Likewise.
> > > > >       (supportable_widening_operation): Likewise.
> > > > >       (vect_gen_perm_mask_any): Allow vector permutes where input 
> > > > > registers
> > > > >       are half the width of the result per the GCC 14 relaxation of
> > > > >       VEC_PERM_EXPR.
> > > > >
> > > > > ---
> > > > > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> > > > > index
> > 4deb3d2c283a2964972b94f434370a6f57ea816a..e8192590ac14005bf7cb5f73
> > 1c16ee7eacb78143 100644
> > > > > --- a/gcc/doc/tm.texi
> > > > > +++ b/gcc/doc/tm.texi
> > > > > @@ -6480,6 +6480,15 @@ type @code{internal_fn}) should be considered
> > expensive when the mask is
> > > > >  all zeros.  GCC can then try to branch around the instruction 
> > > > > instead.
> > > > >  @end deftypefn
> > > > >
> > > > > +@deftypefn {Target Hook} bool
> > TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION (const_tree
> > @var{in_type}, const_tree @var{out_type})
> > > > > +This hook returns true if the operation promoting @var{in_type} to
> > > > > +@var{out_type} should be done as a vector permute.  If 
> > > > > @var{out_type} is
> > > > > +a signed type the operation will be done as the related unsigned 
> > > > > type and
> > > > > +converted to @var{out_type}.  If the target supports the needed 
> > > > > permute,
> > > > > +is able to convert unsigned(@var{out_type}) to @var{out_type} and it 
> > > > > is
> > > > > +beneficial to the hook should return true, else false should be 
> > > > > returned.
> > > > > +@end deftypefn
> > > > > +
> > > > >  @deftypefn {Target Hook} {class vector_costs *}
> > TARGET_VECTORIZE_CREATE_COSTS (vec_info *@var{vinfo}, bool
> > @var{costing_for_scalar})
> > > > >  This hook should initialize target-specific data structures in 
> > > > > preparation
> > > > >  for modeling the costs of vectorizing a loop or basic block.  The 
> > > > > default
> > > > > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> > > > > index
> > 9f147ccb95cc6d4e79cdf5b265666ad502492145..c007bc707372dd374e8effc52
> > d29b76f5bc283a1 100644
> > > > > --- a/gcc/doc/tm.texi.in
> > > > > +++ b/gcc/doc/tm.texi.in
> > > > > @@ -4303,6 +4303,8 @@ address;  but often a machine-dependent strategy
> > can generate better code.
> > > > >
> > > > >  @hook TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE
> > > > >
> > > > > +@hook TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION
> > > > > +
> > > > >  @hook TARGET_VECTORIZE_CREATE_COSTS
> > > > >
> > > > >  @hook TARGET_VECTORIZE_BUILTIN_GATHER
> > > > > diff --git a/gcc/target.def b/gcc/target.def
> > > > > index
> > b31550108883c5c3f5ffc7e46a1e8a7b839ebe83..58545d5ef4248da5850edec8f
> > 4db9f2636973598 100644
> > > > > --- a/gcc/target.def
> > > > > +++ b/gcc/target.def
> > > > > @@ -2056,6 +2056,20 @@ all zeros.  GCC can then try to branch around 
> > > > > the
> > instruction instead.",
> > > > >   (unsigned ifn),
> > > > >   default_empty_mask_is_expensive)
> > > > >
> > > > > +/* Function to say whether a target supports and prefers to use 
> > > > > permutes
> > for
> > > > > +   zero extensions or truncates.  */
> > > > > +DEFHOOK
> > > > > +(use_permute_for_promotion,
> > > > > + "This hook returns true if the operation promoting @var{in_type} 
> > > > > to\n\
> > > > > +@var{out_type} should be done as a vector permute.  If @var{out_type}
> > is\n\
> > > > > +a signed type the operation will be done as the related unsigned 
> > > > > type and\n\
> > > > > +converted to @var{out_type}.  If the target supports the needed
> > permute,\n\
> > > > > +is able to convert unsigned(@var{out_type}) to @var{out_type} and it 
> > > > > is\n\
> > > > > +beneficial to the hook should return true, else false should be 
> > > > > returned.",
> > > > > + bool,
> > > > > + (const_tree in_type, const_tree out_type),
> > > > > + default_use_permute_for_promotion)
> > > > > +
> > > > >  /* Target builtin that implements vector gather operation.  */
> > > > >  DEFHOOK
> > > > >  (builtin_gather,
> > > > > diff --git a/gcc/targhooks.h b/gcc/targhooks.h
> > > > > index
> > 2704d6008f14d2aa65671f002af886d3b802effa..723f8f4fda7808b6899f10f8b
> > 3fafad74d3c536f 100644
> > > > > --- a/gcc/targhooks.h
> > > > > +++ b/gcc/targhooks.h
> > > > > @@ -124,6 +124,7 @@ extern opt_machine_mode
> > default_vectorize_related_mode (machine_mode,
> > > > >  extern opt_machine_mode default_get_mask_mode (machine_mode);
> > > > >  extern bool default_empty_mask_is_expensive (unsigned);
> > > > >  extern bool default_conditional_operation_is_expensive (unsigned);
> > > > > +extern bool default_use_permute_for_promotion (const_tree, 
> > > > > const_tree);
> > > > >  extern vector_costs *default_vectorize_create_costs (vec_info *, 
> > > > > bool);
> > > > >
> > > > >  /* OpenACC hooks.  */
> > > > > diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
> > > > > index
> > dc040df9fcd1182b62d83088ee7fb3a248c99f51..a487eab794fe9f1089ecb58fdf
> > c881fdb19d28f3 100644
> > > > > --- a/gcc/targhooks.cc
> > > > > +++ b/gcc/targhooks.cc
> > > > > @@ -1615,6 +1615,14 @@ default_conditional_operation_is_expensive
> > (unsigned ifn)
> > > > >    return ifn == IFN_MASK_STORE;
> > > > >  }
> > > > >
> > > > > +/* By default no targets prefer permutes over multi step extension.  
> > > > > */
> > > > > +
> > > > > +bool
> > > > > +default_use_permute_for_promotion (const_tree, const_tree)
> > > > > +{
> > > > > +  return false;
> > > > > +}
> > > > > +
> > > > >  /* By default consider masked stores to be expensive.  */
> > > > >
> > > > >  bool
> > > > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > > > > index
> > 4f6905f15417f90c6f36e1711a7a25071f0f507c..f2939655e4ec34111baa8894e
> > af769d29b1c5b82 100644
> > > > > --- a/gcc/tree-vect-stmts.cc
> > > > > +++ b/gcc/tree-vect-stmts.cc
> > > > > @@ -5129,6 +5129,111 @@ vect_create_vectorized_promotion_stmts
> > (vec_info *vinfo,
> > > > >    gimple *new_stmt1, *new_stmt2;
> > > > >    vec<tree> vec_tmp = vNULL;
> > > > >
> > > > > +  /* If we're using a VEC_PERM_EXPR then we're widening to the final 
> > > > > type in
> > > > > +     one go.  */
> > > > > +  if (ch1 == VEC_PERM_EXPR
> > > > > +      && op_type == unary_op)
> > > > > +    {
> > > > > +      vec_tmp.create (vec_oprnds0->length () * 2);
> > > > > +      bool failed_p = false;
> > > > > +
> > > > > +      /* Extending with a vec-perm requires 2 instructions per step. 
> > > > >  */
> > > > > +      FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0)
> > > > > +     {
> > > > > +       tree vectype_in = TREE_TYPE (vop0);
> > > > > +       tree vectype_out = TREE_TYPE (vec_dest);
> > > > > +       machine_mode mode_in = TYPE_MODE (vectype_in);
> > > > > +       machine_mode mode_out = TYPE_MODE (vectype_out);
> > > > > +       unsigned bitsize_in = element_precision (vectype_in);
> > > > > +       unsigned tot_in, tot_out;
> > > > > +       unsigned HOST_WIDE_INT count;
> > > > > +
> > > > > +       /* We can't really support VLA here as the indexes depend on 
> > > > > the VL.
> > > > > +          VLA should really use widening instructions like widening
> > > > > +          loads.  */
> > > > > +       if (!GET_MODE_BITSIZE (mode_in).is_constant (&tot_in)
> > > > > +           || !GET_MODE_BITSIZE (mode_out).is_constant (&tot_out)
> > > > > +           || !TYPE_VECTOR_SUBPARTS (vectype_in).is_constant (&count)
> > > > > +           || !TYPE_UNSIGNED (vectype_in)
> > > > > +           || !targetm.vectorize.use_permute_for_promotion 
> > > > > (vectype_in,
> > > > > +                                                            
> > > > > vectype_out))
> > > > > +         {
> > > > > +           failed_p = true;
> > > > > +           break;
> > > > > +         }
> > > > > +
> > > > > +       unsigned steps = tot_out / bitsize_in;
> > > > > +       tree zero = build_zero_cst (vectype_in);
> > > > > +
> > > > > +       unsigned chunk_size
> > > > > +         = exact_div (TYPE_VECTOR_SUBPARTS (vectype_in),
> > > > > +                      TYPE_VECTOR_SUBPARTS 
> > > > > (vectype_out)).to_constant ();
> > > > > +       unsigned step_size = chunk_size * (tot_out / tot_in);
> > > > > +       unsigned nunits = tot_out / bitsize_in;
> > > > > +
> > > > > +       vec_perm_builder sel (steps, 1, 1);
> > > > > +       sel.quick_grow (steps);
> > > > > +
> > > > > +       /* Flood fill with the out of range value first.  */
> > > > > +       for (unsigned long i = 0; i < steps; ++i)
> > > > > +         sel[i] = count;
> > > > > +
> > > > > +       tree var;
> > > > > +       tree elem_in = TREE_TYPE (vectype_in);
> > > > > +       machine_mode elem_mode_in = TYPE_MODE (elem_in);
> > > > > +       unsigned long idx = 0;
> > > > > +       tree vc_in = get_related_vectype_for_scalar_type 
> > > > > (elem_mode_in,
> > > > > +                                                         elem_in, 
> > > > > nunits);
> > > > > +
> > > > > +       for (unsigned long j = 0; j < chunk_size; j++)
> > > > > +         {
> > > > > +           if (WORDS_BIG_ENDIAN)
> > > > > +             for (int i = steps - 1; i >= 0; i -= step_size, idx++)
> > > > > +               sel[i] = idx;
> > > > > +           else
> > > > > +             for (int i = 0; i < (int)steps; i += step_size, idx++)
> > > > > +               sel[i] = idx;
> > > > > +
> > > > > +           vec_perm_indices indices (sel, 2, steps);
> > > > > +
> > > > > +           tree perm_mask = vect_gen_perm_mask_checked (vc_in, 
> > > > > indices);
> > > > > +           auto vec_oprnd = make_ssa_name (vc_in);
> > > > > +           auto new_stmt = gimple_build_assign (vec_oprnd, 
> > > > > VEC_PERM_EXPR,
> > > > > +                                                vop0, zero, 
> > > > > perm_mask);
> > > > > +           vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, 
> > > > > gsi);
> > > > > +
> > > > > +           tree intvect_out = unsigned_type_for (vectype_out);
> > > > > +           var = make_ssa_name (intvect_out);
> > > > > +           new_stmt = gimple_build_assign (var, build1 
> > > > > (VIEW_CONVERT_EXPR,
> > > > > +                                                        intvect_out,
> > > > > +                                                        vec_oprnd));
> > > > > +           vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, 
> > > > > gsi);
> > > > > +
> > > > > +           gcc_assert (ch2.is_tree_code ());
> > > > > +
> > > > > +           var = make_ssa_name (vectype_out);
> > > > > +           if (ch2 == VIEW_CONVERT_EXPR)
> > > > > +               new_stmt = gimple_build_assign (var,
> > > > > +                                               build1 
> > > > > (VIEW_CONVERT_EXPR,
> > > > > +                                                       vectype_out,
> > > > > +                                                       vec_oprnd));
> > > > > +           else
> > > > > +               new_stmt = gimple_build_assign (var, (tree_code)ch2,
> > > > > +                                               vec_oprnd);
> > > > > +
> > > > > +           vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, 
> > > > > gsi);
> > > > > +           vec_tmp.safe_push (var);
> > > > > +         }
> > > > > +     }
> > > > > +
> > > > > +      if (!failed_p)
> > > > > +     {
> > > > > +       vec_oprnds0->release ();
> > > > > +       *vec_oprnds0 = vec_tmp;
> > > > > +       return;
> > > > > +     }
> > > > > +    }
> > > > > +
> > > > >    vec_tmp.create (vec_oprnds0->length () * 2);
> > > > >    FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0)
> > > > >      {
> > > > > @@ -5495,6 +5600,20 @@ vectorizable_conversion (vec_info *vinfo,
> > > > >         || GET_MODE_SIZE (lhs_mode) <= GET_MODE_SIZE (rhs_mode))
> > > > >       goto unsupported;
> > > > >
> > > > > +      /* Check to see if the target can use a permute to perform the 
> > > > > zero
> > > > > +      extension.  */
> > > > > +      intermediate_type = unsigned_type_for (vectype_out);
> > > > > +      if (TYPE_UNSIGNED (vectype_in)
> > > > > +       && VECTOR_TYPE_P (intermediate_type)
> > > > > +       && TYPE_VECTOR_SUBPARTS (intermediate_type).is_constant ()
> > > > > +       && targetm.vectorize.use_permute_for_promotion (vectype_in,
> > > > > +                                                       
> > > > > intermediate_type))
> > > > > +     {
> > > > > +       code1 = VEC_PERM_EXPR;
> > > > > +       code2 = FLOAT_EXPR;
> > > > > +       break;
> > > > > +     }
> > > > > +
> > > > >        fltsz = GET_MODE_SIZE (lhs_mode);
> > > > >        FOR_EACH_2XWIDER_MODE (rhs_mode_iter, rhs_mode)
> > > > >       {
> > > > > @@ -9804,7 +9923,8 @@ vect_gen_perm_mask_any (tree vectype, const
> > vec_perm_indices &sel)
> > > > >    tree mask_type;
> > > > >
> > > > >    poly_uint64 nunits = sel.length ();
> > > > > -  gcc_assert (known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype)));
> > > > > +  gcc_assert (known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype))
> > > > > +           || known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype) * 2));
> > > > >
> > > > >    mask_type = build_vector_type (ssizetype, nunits);
> > > > >    return vec_perm_indices_to_tree (mask_type, sel);
> > > > > @@ -14397,8 +14517,20 @@ supportable_widening_operation (vec_info
> > *vinfo,
> > > > >        break;
> > > > >
> > > > >      CASE_CONVERT:
> > > > > -      c1 = VEC_UNPACK_LO_EXPR;
> > > > > -      c2 = VEC_UNPACK_HI_EXPR;
> > > > > +      {
> > > > > +     tree cvt_type = unsigned_type_for (vectype_out);
> > > > > +     if (TYPE_UNSIGNED (vectype_in)
> > > > > +       && VECTOR_TYPE_P (cvt_type)
> > > > > +       && TYPE_VECTOR_SUBPARTS (cvt_type).is_constant ()
> > > > > +       && targetm.vectorize.use_permute_for_promotion (vectype_in,
> > cvt_type))
> > > > > +       {
> > > > > +         *code1 = VEC_PERM_EXPR;
> > > > > +         *code2 = VIEW_CONVERT_EXPR;
> > > > > +         return true;
> > > > > +       }
> > > > > +     c1 = VEC_UNPACK_LO_EXPR;
> > > > > +     c2 = VEC_UNPACK_HI_EXPR;
> > > > > +      }
> > > > >        break;
> > > > >
> > > > >      case FLOAT_EXPR:
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > > --
> > > > Richard Biener <rguent...@suse.de>
> > > > SUSE Software Solutions Germany GmbH,
> > > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> > Nuernberg)
> > >
> > >
> > 
> > --
> > Richard Biener <rguent...@suse.de>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

RE: [PATCH 1/4]middle-end: support multi-step zero-extends using VEC_PERM_EXPR

Reply via email to