On Tue, 15 Oct 2024, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <rguent...@suse.de>
> > Sent: Tuesday, October 15, 2024 1:42 PM
> > To: Tamar Christina <tamar.christ...@arm.com>
> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>
> > Subject: Re: [PATCH 4/4]middle-end: create the longest possible zero extend 
> > chain
> > after overwidening
> > 
> > On Mon, 14 Oct 2024, Tamar Christina wrote:
> > 
> > > Hi All,
> > >
> > > Consider loops such as:
> > >
> > > void test9(unsigned char *x, long long *y, int n, unsigned char k) {
> > >     for(int i = 0; i < n; i++) {
> > >         y[i] = k + x[i];
> > >     }
> > > }
> > >
> > > where today we generate:
> > >
> > > .L5:
> > >         ldr     q29, [x5], 16
> > >         add     x4, x4, 128
> > >         uaddl   v1.8h, v29.8b, v30.8b
> > >         uaddl2  v29.8h, v29.16b, v30.16b
> > >         zip1    v2.8h, v1.8h, v31.8h
> > >         zip1    v0.8h, v29.8h, v31.8h
> > >         zip2    v1.8h, v1.8h, v31.8h
> > >         zip2    v29.8h, v29.8h, v31.8h
> > >         sxtl    v25.2d, v2.2s
> > >         sxtl    v28.2d, v0.2s
> > >         sxtl    v27.2d, v1.2s
> > >         sxtl    v26.2d, v29.2s
> > >         sxtl2   v2.2d, v2.4s
> > >         sxtl2   v0.2d, v0.4s
> > >         sxtl2   v1.2d, v1.4s
> > >         sxtl2   v29.2d, v29.4s
> > >         stp     q25, q2, [x4, -128]
> > >         stp     q27, q1, [x4, -96]
> > >         stp     q28, q0, [x4, -64]
> > >         stp     q26, q29, [x4, -32]
> > >         cmp     x5, x6
> > >         bne     .L5
> > >
> > > Note how the zero extend from short to long is half way the chain 
> > > transformed
> > > into a sign extend.  There are two problems with this:
> > >
> > >   1. sign extends are typically slower than zero extends on many uArches.
> > >   2. it prevents vectorizable_conversion from attempting to do a single 
> > > step
> > >      promotion.
> > >
> > > These sign extend happen due to the varous range reduction optimizations 
> > > and
> > > patterns we have, such as multiplication widening, etc.
> > >
> > > My first attempt to fix this was just updating the patterns to when the 
> > > original
> > > source is a zero extend, to not add the intermediate sign extend.
> > >
> > > However this behavior happens in many other places, some of it and as new
> > > patterns get added the problem can be re-introduced.
> > >
> > > Instead I have added a new pattern vect_recog_zero_extend_chain_pattern 
> > > that
> > > attempts to simplify and extend an existing zero extend over multiple
> > > conversions statements.
> > >
> > > As an example, T3 a = (T3)(signed T2)(unsigned T1)x where bitsize T3 > T2 
> > > > T1
> > > gets transformed into T3 a = (T3)(signed T2)(unsigned T2)x.
> > >
> > > The final cast to signed it kept so the types in the tree still match. It 
> > > will
> > > be correctly elided later on.
> > >
> > > This represenation is the most optimal as vectorizable_conversion is 
> > > already
> > > able to decompose a long promotion into multiple steps if the target does 
> > > not
> > > support it in a single step.  More importantly it allows us to do proper 
> > > costing
> > > and support such conversions like (double)x, where bitsize(x) < int in an
> > > efficient manner.
> > >
> > > To do this I have used Ranger's on-demand analysis to perform the check 
> > > to see
> > > if an extension can be removed and extended to zero extend.  The reason 
> > > for this
> > > is that the vectorizer introduces several patterns that are not in the 
> > > IL,  but
> > > also lots of widening IFNs for which handling in a switch wouldn't be very
> > > future proof.
> > >
> > > I did try to do it without Ranger, but ranger had two benefits:
> > >
> > > 1.  It simplified the handling of the IL changes the vectorizer 
> > > introduces, and
> > >     makes it future proof.
> > > 2.  Ranger has the advantage of doing the transformation in cases where it
> > knows
> > >     that the top bits of the value is zero.  Which we wouldn't be able to 
> > > tell
> > >     by looking purely at statements.
> > > 3.  Ranger simplified the handling of corner cases.  Without it the 
> > > handling was
> > >     quite complex and I wasn't very confident in it's correctness.
> > >
> > > So I think ranger is the right way to go here...  With these changes the 
> > > above
> > > now generates:
> > >
> > > .L5:
> > >         add     x4, x4, 128
> > >         ldr     q26, [x5], 16
> > >         uaddl   v2.8h, v26.8b, v31.8b
> > >         uaddl2  v26.8h, v26.16b, v31.16b
> > >         tbl     v4.16b, {v2.16b}, v30.16b
> > >         tbl     v3.16b, {v2.16b}, v29.16b
> > >         tbl     v24.16b, {v2.16b}, v28.16b
> > >         tbl     v1.16b, {v26.16b}, v30.16b
> > >         tbl     v0.16b, {v26.16b}, v29.16b
> > >         tbl     v25.16b, {v26.16b}, v28.16b
> > >         tbl     v2.16b, {v2.16b}, v27.16b
> > >         tbl     v26.16b, {v26.16b}, v27.16b
> > >         stp     q4, q3, [x4, -128]
> > >         stp     q1, q0, [x4, -64]
> > >         stp     q24, q2, [x4, -96]
> > >         stp     q25, q26, [x4, -32]
> > >         cmp     x5, x6
> > >         bne     .L5
> > >
> > > I have also seen similar improvements in codegen on Arm and x86_64, 
> > > especially
> > > with AVX512.
> > >
> > > Bootstrapped Regtested on aarch64-none-linux-gnu, 
> > > arm-none-linux-gnueabihf,
> > > x86_64-pc-linux-gnu -m32, -m64 and no issues.
> > >
> > > Hopefully Ok for master?
> > 
> > Hohumm.  So I looked at one of the examples and I don't see any
> > sign-extends in the IL we vectorize.  So your pattern is about
> > changing int -> double to unsigned int -> double but only so
> > a required intermediate int -> long conversion is done as
> > zero-extend?  IMO this doesn't belong to patterns but to
> > vectorizable_conversion, specifically the step determining the
> > intermediate types.
> 
> There is, it's not in C but created by the vectorizer.
> So the way I saw it, was that vectorizable_conversion should be
> given the choice of what it wants to do.  Whether it wants to do
> it in one operation of multiple.
> 
> If the goal was to just get rid of the final zero extend, yes I would agree.
> 
> But that's just part of the goal, the other is to have the zero extend 
> explicitly
> exposed to vectorizable_conversion.  That way the patch that uses TBL actually
> sees the long multi-step conversion.
> 
> My worry was that if done in vectorizable_conversion, while I can walk the IL,
> we'd cost the intermediate casts.  On AArch64 the cost model takes into 
> account
> the throughput of sequences, not just latencies. And the TBLs have better 
> throughput.
> So for costing you do really want to see the full thing and not cost the 
> intermediate
> Conversions.
> 
> This is why I used a pattern, since the IL is actually changed from the 
> input. But see blow...
> 
> > 
> > I don't quite understand what scalar pattern IL you feed to
> > the vectorizer in the end, few comments - also to this effect,
> > below.
> 
> Sure, I'll answer this and the question below in one go:
> 
> Lets pick a simple example:
> 
> void test8(unsigned char *x, double *y, int n, unsigned char k) {
>     for(int i = 0; i < n; i++) {
>         y[i] = k + x[i];
>     }
> }
> 
> In GIMPLE this generates:
> 
>  _4 = *_3;
>   _5 = (int) _4;
>   _6 = _5 + _29;
>   _9 = (double) _6;
> 
> i.e. the unsigned char, is widened to int, added to k as int and then
> converted to a double.
> 
> When we start vectorizing, overwidening detection runs:
> 
> note:   vect_recog_over_widening_pattern: detected: _6 = _5 + _29;
> note:   demoting int to unsigned short
> note:   Splitting statement: _5 = (int) _4;
> note:   into pattern statements: patt_32 = (unsigned short) _4;
> note:   and: patt_31 = (int) patt_32;
> note:   created pattern stmt: patt_28 = patt_32 + patt_30;
> note:   over_widening pattern recognized: patt_27 = (int) patt_28;
> note:   extra pattern stmt: patt_28 = patt_32 + patt_30;
> note:   vect_is_simple_use: operand (unsigned short) _4, type of def: internal
> 
> and it demotes it correctly from int to unsigned short.  Performs the 
> operation
> as unsigned, and then sign extends that to int.
> 
> The final IL the vectorizer builds is this:
> 
> note:   node 0x473d790 (max_nunits=16, refcnt=2) vector(2) double
> note:   op template: *_8 = _9;
> note:         stmt 0 *_8 = _9;
> note:         children 0x473d828
> note:   node 0x473d828 (max_nunits=16, refcnt=2) vector(2) double
> note:   op template: _9 = (double) _6;
> note:         stmt 0 _9 = (double) _6;
> note:         children 0x473d8c0
> note:   node 0x473d8c0 (max_nunits=16, refcnt=2) vector(4) int
> note:   op template: patt_27 = (int) patt_28;
> note:         stmt 0 patt_27 = (int) patt_28;
> note:         children 0x473d958
> note:   node 0x473d958 (max_nunits=16, refcnt=2) vector(8) unsigned short
> note:   op template: patt_28 = .VEC_WIDEN_PLUS (_4, k_14(D));
> note:         stmt 0 patt_28 = .VEC_WIDEN_PLUS (_4, k_14(D));
> note:         children 0x473d9f0 0x473da88
> note:   node 0x473d9f0 (max_nunits=16, refcnt=2) vector(16) unsigned char
> note:   op template: _4 = *_3;
> note:         stmt 0 _4 = *_3;
> note:         load permutation { 0 }
> note:   node (external) 0x473da88 (max_nunits=1, refcnt=1)
> note:         { k_14(D) }
> 
> We later relax the cast to int as a zero extend today already.  However the 
> final promotion is
> the FLOAT_EXPR.  There it sees a widening from int to long.  It assumes that 
> this has to be a
> sign extend because of the signed input, it doesn't know that the signed 
> input was created
> by a zero extending operation.

OK, so it's like I thought then.  The fix should be to
vectorizable_conversion to consider using a zero-extend for the
conversion to the intermediate long type.  I'm not sure how far
we can use stmt-vinfo min_output_precision for such analysis,
maybe Richard can answer this.  But there's the bad (because it's
wrongly implemented for SLP) example of using range info
in supportable_indirect_convert_operation for this already.

> What my pattern does it make this explicit in the tree.
> 
> My first attempt was to update the code that does:
> 
> /app/example.c:2:22: note:   demoting int to unsigned short
> /app/example.c:2:22: note:   Splitting statement: _5 = (int) _4;
> /app/example.c:2:22: note:   into pattern statements: patt_32 = (unsigned 
> short) _4;
> /app/example.c:2:22: note:   and: patt_31 = (int) patt_32;
> 
> In vect_split_statement,   But even doing so, the problem is that it split 
> the range of the
> conversions.  And so while this fixed some cases.  Code that uses the result 
> never know that
> the top bits are zero.
> 
> So it worked for some cases. But missed out plenty of others.

I think the pattern would have to be for the (double) _5 conversion
and we do not want to expose (double) (long) (unsigned) _5 at that
point because some targets (x86!) _can_ do SImode to DFmode float
conversions and don't require an intermediate widening at all
(vectorizable_conversion knows this).

Richard.

> > So for (double)(int)x you produce (double)(int)(unsigned int)x?  It
> > feels like the above example misses a conversion?
> > 
> 
> No, (double)(int)x is basically (double)(long)(int)x
> 
> And the pattern creates
> 
> (double)(long)(unsigned long)x? Basically it just makes the conversions 
> explicit
> and extends the zero extend as wide as possible.
>
> > I don't think the pure integer type sequence happens - we do
> > (and should if not) already perform canonicalization of equivalent
> > conversion sequences, otherwise you'd see a missed CSE (I couldn't
> > produce such an example).
> 
> It does.  This is a pure integer sequence with the same problem
> 
> void test8(unsigned char *x, long *y, int n, unsigned char k) {
>     for(int i = 0; i < n; i++) {
>         y[i] = k + x[i];
>     }
> }
> 
> And there are other code that introduce this, for instance there's a cleanup 
> after
> multiplication widening specifically that also tries to split types.  And it 
> will do this
> as well.
> 
> > 
> > You say you found the "issue" to be exposed in several (or found in one,
> > suspected in several) existing patterns.  Can you elaborate and point
> > out where this happens?  I don't think it covers the int -> double
> > case, does it?
> 
> Not this case on its own.  You have to do *some* operation in between.
> 
> Thanks,
> Tamar
> 
> > 
> > > Thanks,
> > > Tamar
> > >
> > > gcc/ChangeLog:
> > >
> > >   * tree-vect-patterns.cc (vect_recog_zero_extend_chain_pattern): New.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >   * gcc.dg/vect/bb-slp-pattern-1.c: Update tests.
> > >   * gcc.dg/vect/slp-widen-mult-half.c: Likewise.
> > >   * gcc.dg/vect/vect-over-widen-10.c: Likewise.
> > >   * gcc.dg/vect/vect-over-widen-12.c: Likewise.
> > >   * gcc.dg/vect/vect-over-widen-14.c: Likewise.
> > >   * gcc.dg/vect/vect-over-widen-16.c: Likewise.
> > >   * gcc.dg/vect/vect-over-widen-6.c: Likewise.
> > >   * gcc.dg/vect/vect-over-widen-8.c: Likewise.
> > >   * gcc.dg/vect/vect-widen-mult-u16.c: Likewise.
> > >   * gcc.dg/vect/vect-widen-mult-u8-s16-s32.c: Likewise.
> > >   * lib/target-supports.exp
> > >   (check_effective_target_vect_widen_mult_hi_to_si_pattern,
> > >   check_effective_target_vect_widen_mult_si_to_di_pattern): Enable
> > >   AArch64.
> > >   * gcc.target/aarch64/vect-tbl-zero-extend_2.c: New test.
> > >
> > > ---
> > > diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-pattern-1.c
> > b/gcc/testsuite/gcc.dg/vect/bb-slp-pattern-1.c
> > > index
> > 5ae99225273ca5f915f60ecba3a5aaedebe46e96..627de78af4e48581575beda9
> > 7bf2a0708ac091cb 100644
> > > --- a/gcc/testsuite/gcc.dg/vect/bb-slp-pattern-1.c
> > > +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-pattern-1.c
> > > @@ -52,4 +52,4 @@ int main (void)
> > >
> > >  /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 
> > > "slp2" {
> > target { vect_widen_mult_hi_to_si || vect_unpack } } } } */
> > >  /* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern:
> > detected" 8 "slp2" { target vect_widen_mult_hi_to_si_pattern } } } */
> > > -/* { dg-final { scan-tree-dump-times "pattern recognized" 8 "slp2" { 
> > > target
> > vect_widen_mult_hi_to_si_pattern } } } */
> > > +/* { dg-final { scan-tree-dump-times "widen_mult pattern recognized" 8 
> > > "slp2" {
> > target vect_widen_mult_hi_to_si_pattern } } } */
> > > diff --git a/gcc/testsuite/gcc.dg/vect/slp-widen-mult-half.c
> > b/gcc/testsuite/gcc.dg/vect/slp-widen-mult-half.c
> > > index
> > b69ade338862cda4f44f5206d195eef1cb5e8d36..aecc085a51c93e0e7bed122df
> > 0a77a0a099ad6ef 100644
> > > --- a/gcc/testsuite/gcc.dg/vect/slp-widen-mult-half.c
> > > +++ b/gcc/testsuite/gcc.dg/vect/slp-widen-mult-half.c
> > > @@ -52,5 +52,5 @@ int main (void)
> > >  /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { 
> > > target
> > vect_widen_mult_hi_to_si } } } */
> > >  /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 
> > > "vect" {
> > target vect_widen_mult_hi_to_si } } } */
> > >  /* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern:
> > detected" 2 "vect" { target vect_widen_mult_hi_to_si_pattern } } } */
> > > -/* { dg-final { scan-tree-dump-times "pattern recognized" 2 "vect" { 
> > > target
> > vect_widen_mult_hi_to_si_pattern } } } */
> > > +/* { dg-final { scan-tree-dump-times "pattern recognized" 4 "vect" { 
> > > target
> > vect_widen_mult_hi_to_si_pattern } } } */
> > >
> > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c
> > b/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c
> > > index
> > f0140e4ef6d70cd61aa7dbb3ba39b1da142a79b2..bd798fae7e8136975d48820
> > 6cfef9e39fac2bfea 100644
> > > --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c
> > > +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c
> > > @@ -11,7 +11,7 @@
> > >
> > >  #include "vect-over-widen-9.c"
> > >
> > > -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
> > > +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } 
> > > */
> > >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* \+ } "vect" } } */
> > >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* >> 1} "vect" } } */
> > >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* >> 2} "vect" } } */
> > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-12.c
> > b/gcc/testsuite/gcc.dg/vect/vect-over-widen-12.c
> > > index
> > ddb3bd8c0d378f0138c8cc7f9c6ea3300744b8a8..8c0544e35c29de60e76759f4
> > ed13206278c72925 100644
> > > --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-12.c
> > > +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-12.c
> > > @@ -11,7 +11,7 @@
> > >
> > >  #include "vect-over-widen-11.c"
> > >
> > > -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
> > > +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } 
> > > */
> > >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* \+ } "vect" } } */
> > >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* >> 1} "vect" } } */
> > >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* >> 2} "vect" } } */
> > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c
> > b/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c
> > > index
> > dfa09f5d2cafe329e6d57b5cc681786cc2c7d215..1fe0305c1c4f61d05864ef9778
> > 9726a1dc6ec8b1 100644
> > > --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c
> > > +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c
> > > @@ -11,7 +11,7 @@
> > >
> > >  #include "vect-over-widen-13.c"
> > >
> > > -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
> > > +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } 
> > > */
> > >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* \+} "vect" } } */
> > >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* >> 1} "vect" } } */
> > >  /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern:
> > detected:[^\n]* = \(unsigned char\)} "vect" } } */
> > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-16.c
> > b/gcc/testsuite/gcc.dg/vect/vect-over-widen-16.c
> > > index
> > 4584c586da1e6f13e8c8de4c1291cea0141ebab5..4ecdadf7a035a4f83b1767a06
> > 3a1b0f47bdd543d 100644
> > > --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-16.c
> > > +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-16.c
> > > @@ -11,7 +11,7 @@
> > >
> > >  #include "vect-over-widen-15.c"
> > >
> > > -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
> > > +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } 
> > > */
> > >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* \+} "vect" } } */
> > >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* >> 1} "vect" } } */
> > >  /* { dg-final { scan-tree-dump-not {vect_recog_cast_forwprop_pattern:
> > detected} "vect" } } */
> > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c
> > b/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c
> > > index
> > bda92c965e080dd3f48ec42b6bea16e79d9416cd..6b8c3dfa2c89ce04d7673607
> > ef2d2f14a14eb32f 100644
> > > --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c
> > > +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c
> > > @@ -9,7 +9,7 @@
> > >
> > >  #include "vect-over-widen-5.c"
> > >
> > > -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
> > > +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } 
> > > */
> > >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* \+ } "vect" } } */
> > >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* >> 1} "vect" } } */
> > >  /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern:
> > detected:[^\n]* \(unsigned char\)} "vect" } } */
> > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c
> > b/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c
> > > index
> > 553c0712a79a1d19195dbdab7cbd6fa330685bea..1cf725ff4b7f151097192db1
> > a0b65173c4c83b19 100644
> > > --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c
> > > +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c
> > > @@ -12,7 +12,7 @@
> > >
> > >  #include "vect-over-widen-7.c"
> > >
> > > -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
> > > +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } 
> > > */
> > >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* \+ } "vect" } } */
> > >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* >> 2} "vect" } } */
> > >  /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern:
> > detected:[^\n]* \(unsigned char\)} "vect" } } */
> > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u16.c
> > b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u16.c
> > > index
> > 258d253f401459d448d1ae86f56b0c97815d5b61..b5018f855a72534b4d64d2
> > dc2b7ab2ac0deb674b 100644
> > > --- a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u16.c
> > > +++ b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u16.c
> > > @@ -47,5 +47,5 @@ int main (void)
> > >
> > >  /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { 
> > > target {
> > vect_widen_mult_hi_to_si || vect_unpack } } } } */
> > >  /* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern:
> > detected" 1 "vect" { target vect_widen_mult_hi_to_si_pattern } } } */
> > > -/* { dg-final { scan-tree-dump-times "pattern recognized" 1 "vect" { 
> > > target
> > vect_widen_mult_hi_to_si_pattern } } } */
> > > +/* { dg-final { scan-tree-dump-times "widen_mult pattern recognized" 1 
> > > "vect" {
> > target vect_widen_mult_hi_to_si_pattern } } } */
> > >
> > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c
> > b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c
> > > index
> > 3baafca7b548124ae5c48fdf3c2f07c319155967..ab523ca77652e1f1533889fda9
> > c0eb31c987ffe9 100644
> > > --- a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c
> > > +++ b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c
> > > @@ -47,5 +47,5 @@ int main (void)
> > >
> > >  /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { 
> > > target {
> > vect_widen_mult_hi_to_si || vect_unpack } } } } */
> > >  /* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern:
> > detected" 1 "vect" { target vect_widen_mult_hi_to_si_pattern } } } */
> > > -/* { dg-final { scan-tree-dump-times "pattern recognized" 1 "vect" { 
> > > target
> > vect_widen_mult_hi_to_si_pattern } } } */
> > > +/* { dg-final { scan-tree-dump-times "widen_mult pattern recognized" 1 
> > > "vect" {
> > target vect_widen_mult_hi_to_si_pattern } } } */
> > >
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_2.c
> > b/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_2.c
> > > new file mode 100644
> > > index
> > 0000000000000000000000000000000000000000..1577eacd9dbbb52274d9f
> > 86c77406555b7726482
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_2.c
> > > @@ -0,0 +1,33 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-additional-options "-O3 -std=c99 -march=armv8-a" } */
> > > +
> > > +void test6(unsigned char *x, double *y, int n) {
> > > +    for(int i = 0; i < (n & -8); i++) {
> > > +        y[i] += x[i];
> > > +    }
> > > +}
> > > +
> > > +void test7(unsigned char *x, double *y, int n, unsigned char k) {
> > > +    for(int i = 0; i < (n & -8); i++) {
> > > +        y[i] += k * x[i];
> > > +    }
> > > +}
> > > +
> > > +void test8(unsigned char *x, double *y, int n, unsigned char k) {
> > > +    for(int i = 0; i < (n & -8); i++) {
> > > +        y[i] = k + x[i];
> > > +    }
> > > +}
> > > +
> > > +void test9(unsigned char *x, long long *y, int n, unsigned char k) {
> > > +    for(int i = 0; i < (n & -8); i++) {
> > > +        y[i] = k + x[i];
> > > +    }
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler-times {\tuxtl} 1 } } */
> > > +/* { dg-final { scan-assembler-not {\tuxtl2} } } */
> > > +/* { dg-final { scan-assembler-not {\tzip1} } } */
> > > +/* { dg-final { scan-assembler-not {\tzip2} } } */
> > > +/* { dg-final { scan-assembler-times {\ttbl} 44 } } */
> > > +/* { dg-final { scan-assembler-times {\.LC[0-9]+:} 12 } } */
> > > diff --git a/gcc/testsuite/lib/target-supports.exp 
> > > b/gcc/testsuite/lib/target-
> > supports.exp
> > > index
> > d113a08dff7b2a8ab5bdfe24386d271bff255afc..feae1b8fcf8cd7ab56a8c76c0cd3
> > 034c0a828724 100644
> > > --- a/gcc/testsuite/lib/target-supports.exp
> > > +++ b/gcc/testsuite/lib/target-supports.exp
> > > @@ -8240,6 +8240,7 @@ proc
> > check_effective_target_vect_widen_mult_hi_to_si_pattern { } {
> > >      return [check_cached_effective_target_indexed
> > vect_widen_mult_hi_to_si_pattern {
> > >        expr { [istarget powerpc*-*-*]
> > >        || [istarget ia64-*-*]
> > > +      || [istarget aarch64*-*-*]
> > >        || [istarget loongarch*-*-*]
> > >        || [istarget i?86-*-*] || [istarget x86_64-*-*]
> > >        || ([is-effective-target arm_neon]
> > > @@ -8259,6 +8260,7 @@ proc
> > check_effective_target_vect_widen_mult_si_to_di_pattern { } {
> > >        expr { [istarget ia64-*-*]
> > >        || [istarget i?86-*-*] || [istarget x86_64-*-*]
> > >        || [istarget loongarch*-*-*]
> > > +      || [istarget aarch64*-*-*]
> > >        || ([istarget s390*-*-*]
> > >            && [check_effective_target_s390_vx]) }}]
> > >  }
> > > diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> > > index
> > 9bf8526ac995c6c2678b25f5df4316aec41333e0..74c7269a3ab15cba1ee2ef055
> > 6d25afda851f7f0 100644
> > > --- a/gcc/tree-vect-patterns.cc
> > > +++ b/gcc/tree-vect-patterns.cc
> > > @@ -5524,6 +5524,122 @@ vect_recog_mixed_size_cond_pattern (vec_info
> > *vinfo,
> > >    return pattern_stmt;
> > >  }
> > >
> > > +/* Function vect_recog_zero_extend_chain_pattern
> > > +
> > > +   Try to find the following pattern:
> > > +
> > > +     type x_t;
> > > +     TYPE a_T, b_T, c_T;
> > 
> > But a_T, b_T and c_T are different types - what types?
> > 
> > > +   loop:
> > > +     S1  a_T = (b_T)(c_T)x_t;
> > > +
> > > +   where type 'TYPE' is an integral type which has different size
> > > +   from 'type' and c_T is a zero extend or a sign extend on a value 
> > > whose top
> > > +   bit is known to be zero. a_T can be signed or unsigned.
> > > +
> > > +   Input:
> > > +
> > > +   * STMT_VINFO: The stmt from which the pattern search begins.
> > > +
> > > +   Output:
> > > +
> > > +   * TYPE_OUT: The type of the output of this pattern.
> > > +
> > > +   * Return value: A new stmt that will be used to replace the pattern.
> > > + This replaces multiple chained extensions with the longest possible
> > > + chain or zero extends and a final convert to the required sign.
> > > +
> > > + S1  a_T = (a_T)(unsigned a_T)x_t;  */
> > 
> > So for (double)(int)x you produce (double)(int)(unsigned int)x?  It
> > feels like the above example misses a conversion?
> > 
> > I don't think the pure integer type sequence happens - we do
> > (and should if not) already perform canonicalization of equivalent
> > conversion sequences, otherwise you'd see a missed CSE (I couldn't
> > produce such an example).
> > 
> > You say you found the "issue" to be exposed in several (or found in one,
> > suspected in several) existing patterns.  Can you elaborate and point
> > out where this happens?  I don't think it covers the int -> double
> > case, does it?
> > 
> > Thanks,
> > Richard.
> > 
> > > +
> > > +static gimple *
> > > +vect_recog_zero_extend_chain_pattern (vec_info *vinfo,
> > > +                               stmt_vec_info stmt_vinfo, tree *type_out)
> > > +{
> > > +  gimple *last_stmt = STMT_VINFO_STMT (vect_stmt_to_vectorize
> > (stmt_vinfo));
> > > +
> > > +  if (!is_gimple_assign (last_stmt))
> > > +    return NULL;
> > > +
> > > +  tree_code code = gimple_assign_rhs_code (last_stmt);
> > > +  tree lhs = gimple_assign_lhs (last_stmt);
> > > +  tree rhs = gimple_assign_rhs1 (last_stmt);
> > > +  tree lhs_type = TREE_TYPE (lhs);
> > > +  tree rhs_type = TREE_TYPE (rhs);
> > > +
> > > +  if ((code != FLOAT_EXPR && code != NOP_EXPR)
> > > +      || TYPE_UNSIGNED (lhs_type)
> > > +      || TREE_CODE (rhs_type) != INTEGER_TYPE
> > > +      || TREE_CODE (rhs) != SSA_NAME
> > > +      || STMT_VINFO_DEF_TYPE (stmt_vinfo) != vect_internal_def)
> > > +    return NULL;
> > > +
> > > +  /* Check to see if it's safe to extend the zero extend to the new type.
> > > +     In general this is safe if the rhs1 type is unsigned or if we know 
> > > that
> > > +     the top bits are zero,  this can happen due to all the widening 
> > > operations
> > > +     we have.   For instance a widening addition will have top bits 
> > > zero.  */
> > > +  if (!TYPE_UNSIGNED (rhs_type))
> > > +    {
> > > +      wide_int wcst = get_nonzero_bits (rhs);
> > > +      if (wi::neg_p (wcst) || wi::clz (wcst) == 0)
> > > + return NULL;
> > > +    }
> > > +
> > > +  tree cvt_type = unsigned_type_for (lhs_type);
> > > +
> > > +  tree cvt_vectype = get_vectype_for_scalar_type (vinfo, cvt_type);
> > > +  if (!cvt_vectype || !VECTOR_TYPE_P (cvt_vectype))
> > > +    return NULL;
> > > +
> > > +  tree out_vectype = get_vectype_for_scalar_type (vinfo, lhs_type);
> > > +  if (!out_vectype || !VECTOR_TYPE_P (out_vectype))
> > > +    return NULL;
> > > +
> > > +  stmt_vec_info irhs;
> > > +
> > > +  gimple_ranger ranger;
> > > +
> > > +  /* Dig through any existing conversions to see if we can extend the 
> > > zero
> > > +     extend chain across multiple converts.  */
> > > +  while ((irhs = vect_get_internal_def (vinfo, rhs)))
> > > +    {
> > > +      gimple *g_irhs = STMT_VINFO_STMT (irhs);
> > > +      if (!is_gimple_assign (g_irhs)
> > > +   || gimple_assign_rhs_code (g_irhs) != NOP_EXPR)
> > > + break;
> > > +
> > > +      /* See if we can consume the next conversion as well.  To do this 
> > > it's
> > > +  best to use Ranger as it can see through the intermediate IL that the
> > > +  vectorizer creates throughout pattern matching.  */
> > > +      int_range_max r;
> > > +      ranger.range_of_stmt (r, g_irhs);
> > > +      wide_int nz = r.get_nonzero_bits ();
> > > +      if (wi::neg_p (nz) || wi::clz (nz) == 0)
> > > + break;
> > > +
> > > +      rhs = gimple_assign_rhs1 (g_irhs);
> > > +    }
> > > +
> > > +  /* If the result is a no-op, or we've jumped over a truncate of sort, 
> > > or if
> > > +     nothing would change materially just leave it alone.  */
> > > +  if (TYPE_PRECISION (lhs_type) <= TYPE_PRECISION (TREE_TYPE (rhs))
> > > +      || (code == FLOAT_EXPR && rhs == gimple_assign_rhs1 (last_stmt)))
> > > +    return NULL;
> > > +
> > > +  vect_pattern_detected ("vect_recog_zero_extend_chain_pattern", 
> > > last_stmt);
> > > +
> > > +  tree cast_var = vect_recog_temp_ssa_var (cvt_type, NULL);
> > > +  gimple *pattern_stmt = NULL;
> > > +  pattern_stmt = gimple_build_assign (cast_var, NOP_EXPR, rhs);
> > > +  append_pattern_def_seq (vinfo, stmt_vinfo, pattern_stmt, cvt_vectype);
> > > +
> > > +  tree cvt_var = vect_recog_temp_ssa_var (lhs_type, NULL);
> > > +  pattern_stmt = gimple_build_assign (cvt_var, code, cast_var);
> > > +
> > > +  *type_out = out_vectype;
> > > +
> > > +  return pattern_stmt;
> > > +}
> > > +
> > >
> > >  /* Helper function of vect_recog_bool_pattern.  Called recursively, 
> > > return
> > >     true if bool VAR can and should be optimized that way.  Assume it 
> > > shouldn't
> > > @@ -7509,6 +7625,7 @@ static vect_recog_func vect_vect_recog_func_ptrs[] =
> > {
> > >    { vect_recog_widen_minus_pattern, "widen_minus" },
> > >    { vect_recog_widen_abd_pattern, "widen_abd" },
> > >    /* These must come after the double widening ones.  */
> > > +  { vect_recog_zero_extend_chain_pattern, "zero_extend_chain" },
> > >  };
> > >
> > >  /* Mark statements that are involved in a pattern.  */
> > >
> > >
> > >
> > >
> > >
> > 
> > --
> > Richard Biener <rguent...@suse.de>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Reply via email to