RE: [PATCH 1/4]middle-end: support multi-step zero-extends using VEC_PERM_EXPR

Richard Biener Thu, 17 Oct 2024 05:50:11 -0700

On Tue, 15 Oct 2024, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <[email protected]>
> > Sent: Tuesday, October 15, 2024 1:20 PM
> > To: Tamar Christina <[email protected]>
> > Cc: [email protected]; nd <[email protected]>
> > Subject: RE: [PATCH 1/4]middle-end: support multi-step zero-extends using
> > VEC_PERM_EXPR
> > 
> > On Tue, 15 Oct 2024, Tamar Christina wrote:
> > 
> > > > -----Original Message-----
> > > > From: Richard Biener <[email protected]>
> > > > Sent: Tuesday, October 15, 2024 12:13 PM
> > > > To: Tamar Christina <[email protected]>
> > > > Cc: [email protected]; nd <[email protected]>
> > > > Subject: Re: [PATCH 1/4]middle-end: support multi-step zero-extends 
> > > > using
> > > > VEC_PERM_EXPR
> > > >
> > > > On Tue, 15 Oct 2024, Tamar Christina wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Thanks for the look,
> > > > >
> > > > > The 10/15/2024 09:54, Richard Biener wrote:
> > > > > > On Mon, 14 Oct 2024, Tamar Christina wrote:
> > > > > >
> > > > > > > Hi All,
> > > > > > >
> > > > > > > This patch series adds support for a target to do a direct 
> > > > > > > convertion for
> > zero
> > > > > > > extends using permutes.
> > > > > > >
> > > > > > > To do this it uses a target hook use_permute_for_promotio which 
> > > > > > > must be
> > > > > > > implemented by targets.  This hook is used to indicate:
> > > > > > >
> > > > > > >  1. can a target do this for the given modes.
> > > > > >
> > > > > > can_vec_perm_const_p?
> > > > > >
> > > > > > >  3. can the target convert between various vector modes with a
> > > > VIEW_CONVERT.
> > > > > >
> > > > > > We have modes_tieable_p for this I think.
> > > > > >
> > > > >
> > > > > Yes, though the reason I didn't use either of them was because they 
> > > > > are
> > reporting
> > > > > a capability of the backend.  In which case the hook, which is 
> > > > > already backend
> > > > > specific already should answer these two.
> > > > >
> > > > > I initially had these checks there, but they didn't seem to add 
> > > > > value, for
> > > > > promotions the masks are only dependent on the input and output modes.
> > So
> > > > they really
> > > > > don't change.
> > > > >
> > > > > When you have say a loop that does lots of conversions from say char 
> > > > > to int,
> > it
> > > > seemed
> > > > > like a waste to retest the same permute constants over and over again.
> > > > >
> > > > > I can add them back in if you prefer...
> > > > >
> > > > > > >  2. is it profitable for the target to do it.
> > > > > >
> > > > > > So you say the target can do both ways but both zip and tbl are
> > > > > > permute instructions so I really fail to see the point and why
> > > > > > the target itself doesn't choose to use tbl for unpack.
> > > > > >
> > > > > > Is the intent in the end to have VEC_PERM in the IL rather than
> > > > > > VEC_UNPACK_* so it combines with other VEC_PERMs?
> > > > > >
> > > > >
> > > > > Yes, and this happens quite often, e.g. load permutes or lane 
> > > > > shuffles etc.
> > > > > The reason for exposing them as VEC_PERM was to trigger further
> > optimizations.
> > > > >
> > > > > If you remember the ticket about LOAD_LANES, with this optimization 
> > > > > and an
> > > > open
> > > > > encoding of LOAD_LANES we stop using it in cases where theres a zero 
> > > > > extend
> > > > after
> > > > > the LOAD_LANES, because then you're doing effectively two permutes and
> > the
> > > > LOAD_LANES
> > > > > is no longer beneficial. There are other examples, load and replicate 
> > > > > etc.
> > > > >
> > > > > > That said, I'm not against supporting VEC_PERM code gen from
> > > > > > unsigned promotion but I don't see why we should do this when
> > > > > > the target advertises VEC_UNPACK_* support or direct conversion
> > > > > > support?
> > > > > >
> > > > > > Esp. with adding a "local" cost related hook which cannot take
> > > > > > into accout context.
> > > > > >
> > > > >
> > > > > To summarize a long story:
> > > > >
> > > > >   yes I open encode zero extends as permutes to allow further 
> > > > > optimizations.
> > > > One could convert
> > > > >   vec_unpacks to convert optabs and use that, but that is an opague 
> > > > > value
> > that
> > > > can't be further
> > > > >   optimized.
> > > > >
> > > > >   The hook isn't really a costing thing in the general sense. It's 
> > > > > literally just "do
> > you
> > > > want
> > > > >   permutes yes or no".  The reason it gets the modes is simply that I 
> > > > > don't
> > think a
> > > > single level
> > > > >   extend is worth it, but I can just change it to never try to do 
> > > > > this on more
> > than
> > > > one level.
> > > >
> > > > When you mention LOAD_LANES we do not expose "permutes" in them on
> > > > GIMPLE
> > > > either, so why should we for VEC_UNPACK_*.
> > >
> > > I think not exposing LOAD_LANES in GIMPLE *is* an actual mistake that I 
> > > hope to
> > correct in GCC-16.
> > > Or at least the time we pick LOAD_LANES is too early.  So I don't think 
> > > pointing to
> > this is a convincing
> > > argument.  It's only VLA that I think needs the IL because you have to 
> > > mask the
> > group of operations and
> > > may be hard to reconcile that later on.
> > >
> > > > At what level are the simplifications you see happening then?
> > >
> > > Well, they are currently happening outside of the vectorizer passes 
> > > itself,
> > > more specifically in this case because VN runs match simplifications.
> > 
> > But match doesn't simplify permutes against .LOAD_LANES?  So it's about
> > "other" permutes (from loads) that get simplified?
> > 
> 
> Yes, or other permute after the zero extend.  I shouldn't have mentioned 
> LOAD_LANES
> I think that moved the discussion to a wrong place.
> 
> > > If the concern is that that's late I can lift it to a pattern I suppose.
> > > I didn't use a pattern because similar changes in this area always just 
> > > happened
> > > at codegen.
> > 
> > I was wondering how this plays with my idea of having us "lower"
> > or rather "code generate" to an intermediate SLP representation where
> > we split SLP groups on vector boundaries and are then free to
> > perform permute optimizations that need to know the vector type.
> > 
> > That said - match could as well combine VEC_UNPACK_* with a VEC_PERMUTE
> > with the catch that this duplicates patterns for the
> > VEC_UNPACK_*/VEC_PERMUTE duality we have.
> > 
> > > >
> > > > I do realize we have two ways of expressing zero-extending widenings
> > > > (also truncations btw) and that's always bad - so we could decide to
> > > > _always_ use VEC_PERMs as the canonical representation because those
> > > > combine more easily.  And either match VEC_PERMs back to vec_unpack
> > > > at RTL expansion time or require targets to expose those as constant
> > > > vec_perms as well.  There are targets like GCN where you can't do
> > > > unpacking with permutes of course, so we can't do away with them
> > > > (we could possibly force those targets to expose widening/truncation
> > > > solely with [us]ext and trunc patterns of course).
> > >
> > > Ok, so your objection is that you don't want to have a different way of 
> > > doing
> > > a single step zero extend vs a multi-step zero extend.
> > 
> > My objection is mainly that we do this based on a target decision and
> > without immediate effect on the vector loop and its costing - it's not
> > that we are then able to see we can combine the permutes with others,
> > say in SLP permute optimization.
> 
> I can fix that by lifting the code up as a pattern so it does affect costing
> directly and also gets seen by the vectorizer's permute simplification.
> 
> I agree that that would be a better place for it.  Does that address the
> issue?  Then at least the target decision directly affects vectorization
> like other patterns.


So - how can you teach the SLP permute optimization to treat converts
as permutes?  I think since you can't really do this as pattern either
it doesn't fit a VEC_PERM SLP node either?  Or maybe you can have
VEC_PERM <{a}, {0}, { [0:0], [1:0] }> followed by a node with a
VIEW_CONVERT_EXPR to a wider element type?  So it might be fully
implementable in SLP permute optimization?

> > 
> > > At the moment my patch doesn't care, if you return an unconditional true
> > > then for that target you get VEC_PERM or everything and the vectorizer
> > > won't ever spit out VEC_UNPACKU.
> > >
> > > You're arguing that this should be the default, even if the target does 
> > > not
> > > support it and then we have to somehow undo it during vec_lowering?
> > 
> > I argued that we possibly should do this by default and all targets
> > that can vec_unpack but not vec_perm_const with such a permute can
> > either implement the missing vec_perm_const or they are of the kind
> > that cannot use a permute for this (!modes_tieable_p).
> 
> Ok, and I assume this would catch targets like GCN?  I don't know much about
> What can be converted or not there. I'll go check their modes_tieable_p.

GCN can't pun a V8HI to a V4SI vector, yes.

> > > Otherwise if the target doesn't support the permute it'll be scalarized..
> > >
> > > I guess sure..  But then...
> > >
> > > > There are targets like GCN where you can't do
> > > > unpacking with permutes of course, so we can't do away with them
> > > > (we could possibly force those targets to expose widening/truncation
> > > > solely with [us]ext and trunc patterns of course).
> > >
> > > I guess if can_vec_perm_const_p fails we can undo it.. But it feels like
> > > we lose an element of preference here.  A target *could* do the permute,
> > > but not do it efficiently.
> > 
> > It can do it the same way it would do the vec_unpack?  Or what am I
> > missing here?  Does your permute not exactly replicate vec_unpack_lo/hi?
> 
> It replicates  series of them yeah. What I meant with the above is what should
> happen for targets that haven't implemented vec_perm_const,  but I suppose
> the previous paragraph addresses this.
> 
> > > >
> > > > > I think think there's a lot of merrit in open-encoding zero extends, 
> > > > > but one
> > > > reason this is
> > > > > beneficial on AArch64 for instance is that we can consume the zero 
> > > > > register
> > and
> > > > rewrite the
> > > > > indices to a single register TBL.  Two registers TBLs are slower on 
> > > > > some
> > > > implementations.
> > > >
> > > > But this latter fact can be done by optimizing the RTL?
> > >
> > > Sure, and we do so today.  That's why the example output in the cover 
> > > letter
> > > has only one input register.  The point of this blurb was to point out 
> > > more that
> > > the optimization being beneficial may depend on a specific uarch and as 
> > > such
> > > I believe that a certain element of target buy in is needed.
> > 
> > If it's dependent on uarch then even more so - why not simply
> > expand vec_unpack as tbl then?
> 
> We expand them as ZIPs, because these don't require a lookup table index.
> However again these are only single level unpacks.  It doesn't work for this
> case of multi-level unpacks.  For something like byte -> long, or worse byte 
> -> double
> the number of instructions to match in combine would exceed it's combine 
> limit.
> 
> Additionally they require a lot of patterns.  So simply, we cannot recombine 
> multi-level
> unpacks in RTL.
> 
> The backend however will do something sensible given a VEC_PERM_EXPR.
> 
> But I think this is just a detail we're getting into.
> 
> It sounds like you're ok with doing it unconditionally for any target that 
> supports
> the permutes, and lift it pre analysis (like in a pattern) so it's costed?

I _think_ that I'd be OK to do this as canonicalization, but as said it
requires buy-in and work in all targets.  We should be able to get
rid of VEC_UNPACK_HI/LO as tree code then, GCN doesn't (cannot)
implement any of those but uses [sz]ext/trunc exclusively IIRC.

Richard.

> Did I understand that right?
> 
> Thanks for the discussion so far.
> 
> Tamar
> 
> > > If you want me to do it unconditionally sure, I can do that...
> > >
> > > If so can I get a review on the other patches anyway? They are
> > > independent mostly. Only have some dependencies on the output of the
> > > tests.
> > 
> > Sure, I'm behind stuff - sorry.
> > 
> > Richard.
> > 
> > > Thanks,
> > > Tamar
> > >
> > > >
> > > > Richard.
> > > >
> > > > > Thanks,
> > > > > Tamar
> > > > >
> > > > > > > Using permutations have a big benefit of multi-step zero 
> > > > > > > extensions
> > because
> > > > they
> > > > > > > both reduce the number of needed instructions, but also increase
> > throughput
> > > > as
> > > > > > > the dependency chain is removed.
> > > > > > >
> > > > > > > Concretely on AArch64 this changes:
> > > > > > >
> > > > > > > void test4(unsigned char *x, long long *y, int n) {
> > > > > > >     for(int i = 0; i < n; i++) {
> > > > > > >         y[i] = x[i];
> > > > > > >     }
> > > > > > > }
> > > > > > >
> > > > > > > from generating:
> > > > > > >
> > > > > > > .L4:
> > > > > > >         ldr     q30, [x4], 16
> > > > > > >         add     x3, x3, 128
> > > > > > >         zip1    v1.16b, v30.16b, v31.16b
> > > > > > >         zip2    v30.16b, v30.16b, v31.16b
> > > > > > >         zip1    v2.8h, v1.8h, v31.8h
> > > > > > >         zip1    v0.8h, v30.8h, v31.8h
> > > > > > >         zip2    v1.8h, v1.8h, v31.8h
> > > > > > >         zip2    v30.8h, v30.8h, v31.8h
> > > > > > >         zip1    v26.4s, v2.4s, v31.4s
> > > > > > >         zip1    v29.4s, v0.4s, v31.4s
> > > > > > >         zip1    v28.4s, v1.4s, v31.4s
> > > > > > >         zip1    v27.4s, v30.4s, v31.4s
> > > > > > >         zip2    v2.4s, v2.4s, v31.4s
> > > > > > >         zip2    v0.4s, v0.4s, v31.4s
> > > > > > >         zip2    v1.4s, v1.4s, v31.4s
> > > > > > >         zip2    v30.4s, v30.4s, v31.4s
> > > > > > >         stp     q26, q2, [x3, -128]
> > > > > > >         stp     q28, q1, [x3, -96]
> > > > > > >         stp     q29, q0, [x3, -64]
> > > > > > >         stp     q27, q30, [x3, -32]
> > > > > > >         cmp     x4, x5
> > > > > > >         bne     .L4
> > > > > > >
> > > > > > > and instead we get:
> > > > > > >
> > > > > > > .L4:
> > > > > > >         add     x3, x3, 128
> > > > > > >         ldr     q23, [x4], 16
> > > > > > >         tbl     v5.16b, {v23.16b}, v31.16b
> > > > > > >         tbl     v4.16b, {v23.16b}, v30.16b
> > > > > > >         tbl     v3.16b, {v23.16b}, v29.16b
> > > > > > >         tbl     v2.16b, {v23.16b}, v28.16b
> > > > > > >         tbl     v1.16b, {v23.16b}, v27.16b
> > > > > > >         tbl     v0.16b, {v23.16b}, v26.16b
> > > > > > >         tbl     v22.16b, {v23.16b}, v25.16b
> > > > > > >         tbl     v23.16b, {v23.16b}, v24.16b
> > > > > > >         stp     q5, q4, [x3, -128]
> > > > > > >         stp     q3, q2, [x3, -96]
> > > > > > >         stp     q1, q0, [x3, -64]
> > > > > > >         stp     q22, q23, [x3, -32]
> > > > > > >         cmp     x4, x5
> > > > > > >         bne     .L4
> > > > > > >
> > > > > > > Tests are added in the AArch64 patch introducing the hook.  The 
> > > > > > > testsuite
> > also
> > > > > > > already had about 800 runtime tests that get affected by this.
> > > > > > >
> > > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu, arm-none-linux-
> > > > gnueabihf,
> > > > > > > x86_64-pc-linux-gnu -m32, -m64 and no issues.
> > > > > > >
> > > > > > > Ok for master?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Tamar
> > > > > > >
> > > > > > > gcc/ChangeLog:
> > > > > > >
> > > > > > >   * target.def (use_permute_for_promotion): New.
> > > > > > >   * doc/tm.texi.in: Document it.
> > > > > > >   * doc/tm.texi: Regenerate.
> > > > > > >   * targhooks.cc (default_use_permute_for_promotion): New.
> > > > > > >   * targhooks.h (default_use_permute_for_promotion): New.
> > > > > > >   (vectorizable_conversion): Support direct convertion with
> > permute.
> > > > > > >   * tree-vect-stmts.cc (vect_create_vectorized_promotion_stmts):
> > Likewise.
> > > > > > >   (supportable_widening_operation): Likewise.
> > > > > > >   (vect_gen_perm_mask_any): Allow vector permutes where input
> > registers
> > > > > > >   are half the width of the result per the GCC 14 relaxation of
> > > > > > >   VEC_PERM_EXPR.
> > > > > > >
> > > > > > > ---
> > > > > > > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> > > > > > > index
> > > >
> > 4deb3d2c283a2964972b94f434370a6f57ea816a..e8192590ac14005bf7cb5f73
> > > > 1c16ee7eacb78143 100644
> > > > > > > --- a/gcc/doc/tm.texi
> > > > > > > +++ b/gcc/doc/tm.texi
> > > > > > > @@ -6480,6 +6480,15 @@ type @code{internal_fn}) should be
> > considered
> > > > expensive when the mask is
> > > > > > >  all zeros.  GCC can then try to branch around the instruction 
> > > > > > > instead.
> > > > > > >  @end deftypefn
> > > > > > >
> > > > > > > +@deftypefn {Target Hook} bool
> > > > TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION (const_tree
> > > > @var{in_type}, const_tree @var{out_type})
> > > > > > > +This hook returns true if the operation promoting @var{in_type} 
> > > > > > > to
> > > > > > > +@var{out_type} should be done as a vector permute.  If 
> > > > > > > @var{out_type}
> > is
> > > > > > > +a signed type the operation will be done as the related unsigned 
> > > > > > > type and
> > > > > > > +converted to @var{out_type}.  If the target supports the needed
> > permute,
> > > > > > > +is able to convert unsigned(@var{out_type}) to @var{out_type} 
> > > > > > > and it is
> > > > > > > +beneficial to the hook should return true, else false should be 
> > > > > > > returned.
> > > > > > > +@end deftypefn
> > > > > > > +
> > > > > > >  @deftypefn {Target Hook} {class vector_costs *}
> > > > TARGET_VECTORIZE_CREATE_COSTS (vec_info *@var{vinfo}, bool
> > > > @var{costing_for_scalar})
> > > > > > >  This hook should initialize target-specific data structures in 
> > > > > > > preparation
> > > > > > >  for modeling the costs of vectorizing a loop or basic block.  
> > > > > > > The default
> > > > > > > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> > > > > > > index
> > > >
> > 9f147ccb95cc6d4e79cdf5b265666ad502492145..c007bc707372dd374e8effc52
> > > > d29b76f5bc283a1 100644
> > > > > > > --- a/gcc/doc/tm.texi.in
> > > > > > > +++ b/gcc/doc/tm.texi.in
> > > > > > > @@ -4303,6 +4303,8 @@ address;  but often a machine-dependent
> > strategy
> > > > can generate better code.
> > > > > > >
> > > > > > >  @hook TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE
> > > > > > >
> > > > > > > +@hook TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION
> > > > > > > +
> > > > > > >  @hook TARGET_VECTORIZE_CREATE_COSTS
> > > > > > >
> > > > > > >  @hook TARGET_VECTORIZE_BUILTIN_GATHER
> > > > > > > diff --git a/gcc/target.def b/gcc/target.def
> > > > > > > index
> > > >
> > b31550108883c5c3f5ffc7e46a1e8a7b839ebe83..58545d5ef4248da5850edec8f
> > > > 4db9f2636973598 100644
> > > > > > > --- a/gcc/target.def
> > > > > > > +++ b/gcc/target.def
> > > > > > > @@ -2056,6 +2056,20 @@ all zeros.  GCC can then try to branch 
> > > > > > > around
> > the
> > > > instruction instead.",
> > > > > > >   (unsigned ifn),
> > > > > > >   default_empty_mask_is_expensive)
> > > > > > >
> > > > > > > +/* Function to say whether a target supports and prefers to use
> > permutes
> > > > for
> > > > > > > +   zero extensions or truncates.  */
> > > > > > > +DEFHOOK
> > > > > > > +(use_permute_for_promotion,
> > > > > > > + "This hook returns true if the operation promoting 
> > > > > > > @var{in_type} to\n\
> > > > > > > +@var{out_type} should be done as a vector permute.  If 
> > > > > > > @var{out_type}
> > > > is\n\
> > > > > > > +a signed type the operation will be done as the related unsigned 
> > > > > > > type
> > and\n\
> > > > > > > +converted to @var{out_type}.  If the target supports the needed
> > > > permute,\n\
> > > > > > > +is able to convert unsigned(@var{out_type}) to @var{out_type} 
> > > > > > > and it
> > is\n\
> > > > > > > +beneficial to the hook should return true, else false should be 
> > > > > > > returned.",
> > > > > > > + bool,
> > > > > > > + (const_tree in_type, const_tree out_type),
> > > > > > > + default_use_permute_for_promotion)
> > > > > > > +
> > > > > > >  /* Target builtin that implements vector gather operation.  */
> > > > > > >  DEFHOOK
> > > > > > >  (builtin_gather,
> > > > > > > diff --git a/gcc/targhooks.h b/gcc/targhooks.h
> > > > > > > index
> > > >
> > 2704d6008f14d2aa65671f002af886d3b802effa..723f8f4fda7808b6899f10f8b
> > > > 3fafad74d3c536f 100644
> > > > > > > --- a/gcc/targhooks.h
> > > > > > > +++ b/gcc/targhooks.h
> > > > > > > @@ -124,6 +124,7 @@ extern opt_machine_mode
> > > > default_vectorize_related_mode (machine_mode,
> > > > > > >  extern opt_machine_mode default_get_mask_mode (machine_mode);
> > > > > > >  extern bool default_empty_mask_is_expensive (unsigned);
> > > > > > >  extern bool default_conditional_operation_is_expensive 
> > > > > > > (unsigned);
> > > > > > > +extern bool default_use_permute_for_promotion (const_tree,
> > const_tree);
> > > > > > >  extern vector_costs *default_vectorize_create_costs (vec_info *, 
> > > > > > > bool);
> > > > > > >
> > > > > > >  /* OpenACC hooks.  */
> > > > > > > diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
> > > > > > > index
> > > >
> > dc040df9fcd1182b62d83088ee7fb3a248c99f51..a487eab794fe9f1089ecb58fdf
> > > > c881fdb19d28f3 100644
> > > > > > > --- a/gcc/targhooks.cc
> > > > > > > +++ b/gcc/targhooks.cc
> > > > > > > @@ -1615,6 +1615,14 @@ default_conditional_operation_is_expensive
> > > > (unsigned ifn)
> > > > > > >    return ifn == IFN_MASK_STORE;
> > > > > > >  }
> > > > > > >
> > > > > > > +/* By default no targets prefer permutes over multi step 
> > > > > > > extension.  */
> > > > > > > +
> > > > > > > +bool
> > > > > > > +default_use_permute_for_promotion (const_tree, const_tree)
> > > > > > > +{
> > > > > > > +  return false;
> > > > > > > +}
> > > > > > > +
> > > > > > >  /* By default consider masked stores to be expensive.  */
> > > > > > >
> > > > > > >  bool
> > > > > > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > > > > > > index
> > > >
> > 4f6905f15417f90c6f36e1711a7a25071f0f507c..f2939655e4ec34111baa8894e
> > > > af769d29b1c5b82 100644
> > > > > > > --- a/gcc/tree-vect-stmts.cc
> > > > > > > +++ b/gcc/tree-vect-stmts.cc
> > > > > > > @@ -5129,6 +5129,111 @@ vect_create_vectorized_promotion_stmts
> > > > (vec_info *vinfo,
> > > > > > >    gimple *new_stmt1, *new_stmt2;
> > > > > > >    vec<tree> vec_tmp = vNULL;
> > > > > > >
> > > > > > > +  /* If we're using a VEC_PERM_EXPR then we're widening to the 
> > > > > > > final
> > type in
> > > > > > > +     one go.  */
> > > > > > > +  if (ch1 == VEC_PERM_EXPR
> > > > > > > +      && op_type == unary_op)
> > > > > > > +    {
> > > > > > > +      vec_tmp.create (vec_oprnds0->length () * 2);
> > > > > > > +      bool failed_p = false;
> > > > > > > +
> > > > > > > +      /* Extending with a vec-perm requires 2 instructions per 
> > > > > > > step.  */
> > > > > > > +      FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0)
> > > > > > > + {
> > > > > > > +   tree vectype_in = TREE_TYPE (vop0);
> > > > > > > +   tree vectype_out = TREE_TYPE (vec_dest);
> > > > > > > +   machine_mode mode_in = TYPE_MODE (vectype_in);
> > > > > > > +   machine_mode mode_out = TYPE_MODE (vectype_out);
> > > > > > > +   unsigned bitsize_in = element_precision (vectype_in);
> > > > > > > +   unsigned tot_in, tot_out;
> > > > > > > +   unsigned HOST_WIDE_INT count;
> > > > > > > +
> > > > > > > +   /* We can't really support VLA here as the indexes depend on 
> > > > > > > the
> > VL.
> > > > > > > +      VLA should really use widening instructions like widening
> > > > > > > +      loads.  */
> > > > > > > +   if (!GET_MODE_BITSIZE (mode_in).is_constant (&tot_in)
> > > > > > > +       || !GET_MODE_BITSIZE (mode_out).is_constant (&tot_out)
> > > > > > > +       || !TYPE_VECTOR_SUBPARTS (vectype_in).is_constant
> > (&count)
> > > > > > > +       || !TYPE_UNSIGNED (vectype_in)
> > > > > > > +       || !targetm.vectorize.use_permute_for_promotion
> > (vectype_in,
> > > > > > > +                                                        
> > > > > > > vectype_out))
> > > > > > > +     {
> > > > > > > +       failed_p = true;
> > > > > > > +       break;
> > > > > > > +     }
> > > > > > > +
> > > > > > > +   unsigned steps = tot_out / bitsize_in;
> > > > > > > +   tree zero = build_zero_cst (vectype_in);
> > > > > > > +
> > > > > > > +   unsigned chunk_size
> > > > > > > +     = exact_div (TYPE_VECTOR_SUBPARTS (vectype_in),
> > > > > > > +                  TYPE_VECTOR_SUBPARTS
> > (vectype_out)).to_constant ();
> > > > > > > +   unsigned step_size = chunk_size * (tot_out / tot_in);
> > > > > > > +   unsigned nunits = tot_out / bitsize_in;
> > > > > > > +
> > > > > > > +   vec_perm_builder sel (steps, 1, 1);
> > > > > > > +   sel.quick_grow (steps);
> > > > > > > +
> > > > > > > +   /* Flood fill with the out of range value first.  */
> > > > > > > +   for (unsigned long i = 0; i < steps; ++i)
> > > > > > > +     sel[i] = count;
> > > > > > > +
> > > > > > > +   tree var;
> > > > > > > +   tree elem_in = TREE_TYPE (vectype_in);
> > > > > > > +   machine_mode elem_mode_in = TYPE_MODE (elem_in);
> > > > > > > +   unsigned long idx = 0;
> > > > > > > +   tree vc_in = get_related_vectype_for_scalar_type 
> > > > > > > (elem_mode_in,
> > > > > > > +                                                     elem_in,
> > nunits);
> > > > > > > +
> > > > > > > +   for (unsigned long j = 0; j < chunk_size; j++)
> > > > > > > +     {
> > > > > > > +       if (WORDS_BIG_ENDIAN)
> > > > > > > +         for (int i = steps - 1; i >= 0; i -= step_size, idx++)
> > > > > > > +           sel[i] = idx;
> > > > > > > +       else
> > > > > > > +         for (int i = 0; i < (int)steps; i += step_size, idx++)
> > > > > > > +           sel[i] = idx;
> > > > > > > +
> > > > > > > +       vec_perm_indices indices (sel, 2, steps);
> > > > > > > +
> > > > > > > +       tree perm_mask = vect_gen_perm_mask_checked (vc_in,
> > indices);
> > > > > > > +       auto vec_oprnd = make_ssa_name (vc_in);
> > > > > > > +       auto new_stmt = gimple_build_assign (vec_oprnd,
> > VEC_PERM_EXPR,
> > > > > > > +                                            vop0, zero, 
> > > > > > > perm_mask);
> > > > > > > +       vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, 
> > > > > > > gsi);
> > > > > > > +
> > > > > > > +       tree intvect_out = unsigned_type_for (vectype_out);
> > > > > > > +       var = make_ssa_name (intvect_out);
> > > > > > > +       new_stmt = gimple_build_assign (var, build1
> > (VIEW_CONVERT_EXPR,
> > > > > > > +                                                    intvect_out,
> > > > > > > +                                                    vec_oprnd));
> > > > > > > +       vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, 
> > > > > > > gsi);
> > > > > > > +
> > > > > > > +       gcc_assert (ch2.is_tree_code ());
> > > > > > > +
> > > > > > > +       var = make_ssa_name (vectype_out);
> > > > > > > +       if (ch2 == VIEW_CONVERT_EXPR)
> > > > > > > +           new_stmt = gimple_build_assign (var,
> > > > > > > +                                           build1
> > (VIEW_CONVERT_EXPR,
> > > > > > > +                                                   vectype_out,
> > > > > > > +                                                   vec_oprnd));
> > > > > > > +       else
> > > > > > > +           new_stmt = gimple_build_assign (var, (tree_code)ch2,
> > > > > > > +                                           vec_oprnd);
> > > > > > > +
> > > > > > > +       vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, 
> > > > > > > gsi);
> > > > > > > +       vec_tmp.safe_push (var);
> > > > > > > +     }
> > > > > > > + }
> > > > > > > +
> > > > > > > +      if (!failed_p)
> > > > > > > + {
> > > > > > > +   vec_oprnds0->release ();
> > > > > > > +   *vec_oprnds0 = vec_tmp;
> > > > > > > +   return;
> > > > > > > + }
> > > > > > > +    }
> > > > > > > +
> > > > > > >    vec_tmp.create (vec_oprnds0->length () * 2);
> > > > > > >    FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0)
> > > > > > >      {
> > > > > > > @@ -5495,6 +5600,20 @@ vectorizable_conversion (vec_info *vinfo,
> > > > > > >     || GET_MODE_SIZE (lhs_mode) <= GET_MODE_SIZE (rhs_mode))
> > > > > > >   goto unsupported;
> > > > > > >
> > > > > > > +      /* Check to see if the target can use a permute to perform 
> > > > > > > the zero
> > > > > > > +  extension.  */
> > > > > > > +      intermediate_type = unsigned_type_for (vectype_out);
> > > > > > > +      if (TYPE_UNSIGNED (vectype_in)
> > > > > > > +   && VECTOR_TYPE_P (intermediate_type)
> > > > > > > +   && TYPE_VECTOR_SUBPARTS (intermediate_type).is_constant ()
> > > > > > > +   && targetm.vectorize.use_permute_for_promotion (vectype_in,
> > > > > > > +
> > intermediate_type))
> > > > > > > + {
> > > > > > > +   code1 = VEC_PERM_EXPR;
> > > > > > > +   code2 = FLOAT_EXPR;
> > > > > > > +   break;
> > > > > > > + }
> > > > > > > +
> > > > > > >        fltsz = GET_MODE_SIZE (lhs_mode);
> > > > > > >        FOR_EACH_2XWIDER_MODE (rhs_mode_iter, rhs_mode)
> > > > > > >   {
> > > > > > > @@ -9804,7 +9923,8 @@ vect_gen_perm_mask_any (tree vectype,
> > const
> > > > vec_perm_indices &sel)
> > > > > > >    tree mask_type;
> > > > > > >
> > > > > > >    poly_uint64 nunits = sel.length ();
> > > > > > > -  gcc_assert (known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype)));
> > > > > > > +  gcc_assert (known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype))
> > > > > > > +       || known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype) *
> > 2));
> > > > > > >
> > > > > > >    mask_type = build_vector_type (ssizetype, nunits);
> > > > > > >    return vec_perm_indices_to_tree (mask_type, sel);
> > > > > > > @@ -14397,8 +14517,20 @@ supportable_widening_operation
> > (vec_info
> > > > *vinfo,
> > > > > > >        break;
> > > > > > >
> > > > > > >      CASE_CONVERT:
> > > > > > > -      c1 = VEC_UNPACK_LO_EXPR;
> > > > > > > -      c2 = VEC_UNPACK_HI_EXPR;
> > > > > > > +      {
> > > > > > > + tree cvt_type = unsigned_type_for (vectype_out);
> > > > > > > + if (TYPE_UNSIGNED (vectype_in)
> > > > > > > +   && VECTOR_TYPE_P (cvt_type)
> > > > > > > +   && TYPE_VECTOR_SUBPARTS (cvt_type).is_constant ()
> > > > > > > +   && targetm.vectorize.use_permute_for_promotion (vectype_in,
> > > > cvt_type))
> > > > > > > +   {
> > > > > > > +     *code1 = VEC_PERM_EXPR;
> > > > > > > +     *code2 = VIEW_CONVERT_EXPR;
> > > > > > > +     return true;
> > > > > > > +   }
> > > > > > > + c1 = VEC_UNPACK_LO_EXPR;
> > > > > > > + c2 = VEC_UNPACK_HI_EXPR;
> > > > > > > +      }
> > > > > > >        break;
> > > > > > >
> > > > > > >      case FLOAT_EXPR:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > > --
> > > > > > Richard Biener <[email protected]>
> > > > > > SUSE Software Solutions Germany GmbH,
> > > > > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> > > > Nuernberg)
> > > > >
> > > > >
> > > >
> > > > --
> > > > Richard Biener <[email protected]>
> > > > SUSE Software Solutions Germany GmbH,
> > > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> > Nuernberg)
> > >
> > 
> > --
> > Richard Biener <[email protected]>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
> 

-- 
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

RE: [PATCH 1/4]middle-end: support multi-step zero-extends using VEC_PERM_EXPR

Reply via email to