On Tue, 15 Oct 2024, Tamar Christina wrote: > > -----Original Message----- > > From: Richard Biener <rguent...@suse.de> > > Sent: Tuesday, October 15, 2024 1:42 PM > > To: Tamar Christina <tamar.christ...@arm.com> > > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com> > > Subject: Re: [PATCH 4/4]middle-end: create the longest possible zero extend > > chain > > after overwidening > > > > On Mon, 14 Oct 2024, Tamar Christina wrote: > > > > > Hi All, > > > > > > Consider loops such as: > > > > > > void test9(unsigned char *x, long long *y, int n, unsigned char k) { > > > for(int i = 0; i < n; i++) { > > > y[i] = k + x[i]; > > > } > > > } > > > > > > where today we generate: > > > > > > .L5: > > > ldr q29, [x5], 16 > > > add x4, x4, 128 > > > uaddl v1.8h, v29.8b, v30.8b > > > uaddl2 v29.8h, v29.16b, v30.16b > > > zip1 v2.8h, v1.8h, v31.8h > > > zip1 v0.8h, v29.8h, v31.8h > > > zip2 v1.8h, v1.8h, v31.8h > > > zip2 v29.8h, v29.8h, v31.8h > > > sxtl v25.2d, v2.2s > > > sxtl v28.2d, v0.2s > > > sxtl v27.2d, v1.2s > > > sxtl v26.2d, v29.2s > > > sxtl2 v2.2d, v2.4s > > > sxtl2 v0.2d, v0.4s > > > sxtl2 v1.2d, v1.4s > > > sxtl2 v29.2d, v29.4s > > > stp q25, q2, [x4, -128] > > > stp q27, q1, [x4, -96] > > > stp q28, q0, [x4, -64] > > > stp q26, q29, [x4, -32] > > > cmp x5, x6 > > > bne .L5 > > > > > > Note how the zero extend from short to long is half way the chain > > > transformed > > > into a sign extend. There are two problems with this: > > > > > > 1. sign extends are typically slower than zero extends on many uArches. > > > 2. it prevents vectorizable_conversion from attempting to do a single > > > step > > > promotion. > > > > > > These sign extend happen due to the varous range reduction optimizations > > > and > > > patterns we have, such as multiplication widening, etc. > > > > > > My first attempt to fix this was just updating the patterns to when the > > > original > > > source is a zero extend, to not add the intermediate sign extend. > > > > > > However this behavior happens in many other places, some of it and as new > > > patterns get added the problem can be re-introduced. > > > > > > Instead I have added a new pattern vect_recog_zero_extend_chain_pattern > > > that > > > attempts to simplify and extend an existing zero extend over multiple > > > conversions statements. > > > > > > As an example, T3 a = (T3)(signed T2)(unsigned T1)x where bitsize T3 > T2 > > > > T1 > > > gets transformed into T3 a = (T3)(signed T2)(unsigned T2)x. > > > > > > The final cast to signed it kept so the types in the tree still match. It > > > will > > > be correctly elided later on. > > > > > > This represenation is the most optimal as vectorizable_conversion is > > > already > > > able to decompose a long promotion into multiple steps if the target does > > > not > > > support it in a single step. More importantly it allows us to do proper > > > costing > > > and support such conversions like (double)x, where bitsize(x) < int in an > > > efficient manner. > > > > > > To do this I have used Ranger's on-demand analysis to perform the check > > > to see > > > if an extension can be removed and extended to zero extend. The reason > > > for this > > > is that the vectorizer introduces several patterns that are not in the > > > IL, but > > > also lots of widening IFNs for which handling in a switch wouldn't be very > > > future proof. > > > > > > I did try to do it without Ranger, but ranger had two benefits: > > > > > > 1. It simplified the handling of the IL changes the vectorizer > > > introduces, and > > > makes it future proof. > > > 2. Ranger has the advantage of doing the transformation in cases where it > > knows > > > that the top bits of the value is zero. Which we wouldn't be able to > > > tell > > > by looking purely at statements. > > > 3. Ranger simplified the handling of corner cases. Without it the > > > handling was > > > quite complex and I wasn't very confident in it's correctness. > > > > > > So I think ranger is the right way to go here... With these changes the > > > above > > > now generates: > > > > > > .L5: > > > add x4, x4, 128 > > > ldr q26, [x5], 16 > > > uaddl v2.8h, v26.8b, v31.8b > > > uaddl2 v26.8h, v26.16b, v31.16b > > > tbl v4.16b, {v2.16b}, v30.16b > > > tbl v3.16b, {v2.16b}, v29.16b > > > tbl v24.16b, {v2.16b}, v28.16b > > > tbl v1.16b, {v26.16b}, v30.16b > > > tbl v0.16b, {v26.16b}, v29.16b > > > tbl v25.16b, {v26.16b}, v28.16b > > > tbl v2.16b, {v2.16b}, v27.16b > > > tbl v26.16b, {v26.16b}, v27.16b > > > stp q4, q3, [x4, -128] > > > stp q1, q0, [x4, -64] > > > stp q24, q2, [x4, -96] > > > stp q25, q26, [x4, -32] > > > cmp x5, x6 > > > bne .L5 > > > > > > I have also seen similar improvements in codegen on Arm and x86_64, > > > especially > > > with AVX512. > > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu, > > > arm-none-linux-gnueabihf, > > > x86_64-pc-linux-gnu -m32, -m64 and no issues. > > > > > > Hopefully Ok for master? > > > > Hohumm. So I looked at one of the examples and I don't see any > > sign-extends in the IL we vectorize. So your pattern is about > > changing int -> double to unsigned int -> double but only so > > a required intermediate int -> long conversion is done as > > zero-extend? IMO this doesn't belong to patterns but to > > vectorizable_conversion, specifically the step determining the > > intermediate types. > > There is, it's not in C but created by the vectorizer. > So the way I saw it, was that vectorizable_conversion should be > given the choice of what it wants to do. Whether it wants to do > it in one operation of multiple. > > If the goal was to just get rid of the final zero extend, yes I would agree. > > But that's just part of the goal, the other is to have the zero extend > explicitly > exposed to vectorizable_conversion. That way the patch that uses TBL actually > sees the long multi-step conversion. > > My worry was that if done in vectorizable_conversion, while I can walk the IL, > we'd cost the intermediate casts. On AArch64 the cost model takes into > account > the throughput of sequences, not just latencies. And the TBLs have better > throughput. > So for costing you do really want to see the full thing and not cost the > intermediate > Conversions. > > This is why I used a pattern, since the IL is actually changed from the > input. But see blow... > > > > > I don't quite understand what scalar pattern IL you feed to > > the vectorizer in the end, few comments - also to this effect, > > below. > > Sure, I'll answer this and the question below in one go: > > Lets pick a simple example: > > void test8(unsigned char *x, double *y, int n, unsigned char k) { > for(int i = 0; i < n; i++) { > y[i] = k + x[i]; > } > } > > In GIMPLE this generates: > > _4 = *_3; > _5 = (int) _4; > _6 = _5 + _29; > _9 = (double) _6; > > i.e. the unsigned char, is widened to int, added to k as int and then > converted to a double. > > When we start vectorizing, overwidening detection runs: > > note: vect_recog_over_widening_pattern: detected: _6 = _5 + _29; > note: demoting int to unsigned short > note: Splitting statement: _5 = (int) _4; > note: into pattern statements: patt_32 = (unsigned short) _4; > note: and: patt_31 = (int) patt_32; > note: created pattern stmt: patt_28 = patt_32 + patt_30; > note: over_widening pattern recognized: patt_27 = (int) patt_28; > note: extra pattern stmt: patt_28 = patt_32 + patt_30; > note: vect_is_simple_use: operand (unsigned short) _4, type of def: internal > > and it demotes it correctly from int to unsigned short. Performs the > operation > as unsigned, and then sign extends that to int. > > The final IL the vectorizer builds is this: > > note: node 0x473d790 (max_nunits=16, refcnt=2) vector(2) double > note: op template: *_8 = _9; > note: stmt 0 *_8 = _9; > note: children 0x473d828 > note: node 0x473d828 (max_nunits=16, refcnt=2) vector(2) double > note: op template: _9 = (double) _6; > note: stmt 0 _9 = (double) _6; > note: children 0x473d8c0 > note: node 0x473d8c0 (max_nunits=16, refcnt=2) vector(4) int > note: op template: patt_27 = (int) patt_28; > note: stmt 0 patt_27 = (int) patt_28; > note: children 0x473d958 > note: node 0x473d958 (max_nunits=16, refcnt=2) vector(8) unsigned short > note: op template: patt_28 = .VEC_WIDEN_PLUS (_4, k_14(D)); > note: stmt 0 patt_28 = .VEC_WIDEN_PLUS (_4, k_14(D)); > note: children 0x473d9f0 0x473da88 > note: node 0x473d9f0 (max_nunits=16, refcnt=2) vector(16) unsigned char > note: op template: _4 = *_3; > note: stmt 0 _4 = *_3; > note: load permutation { 0 } > note: node (external) 0x473da88 (max_nunits=1, refcnt=1) > note: { k_14(D) } > > We later relax the cast to int as a zero extend today already. However the > final promotion is > the FLOAT_EXPR. There it sees a widening from int to long. It assumes that > this has to be a > sign extend because of the signed input, it doesn't know that the signed > input was created > by a zero extending operation.
OK, so it's like I thought then. The fix should be to vectorizable_conversion to consider using a zero-extend for the conversion to the intermediate long type. I'm not sure how far we can use stmt-vinfo min_output_precision for such analysis, maybe Richard can answer this. But there's the bad (because it's wrongly implemented for SLP) example of using range info in supportable_indirect_convert_operation for this already. > What my pattern does it make this explicit in the tree. > > My first attempt was to update the code that does: > > /app/example.c:2:22: note: demoting int to unsigned short > /app/example.c:2:22: note: Splitting statement: _5 = (int) _4; > /app/example.c:2:22: note: into pattern statements: patt_32 = (unsigned > short) _4; > /app/example.c:2:22: note: and: patt_31 = (int) patt_32; > > In vect_split_statement, But even doing so, the problem is that it split > the range of the > conversions. And so while this fixed some cases. Code that uses the result > never know that > the top bits are zero. > > So it worked for some cases. But missed out plenty of others. I think the pattern would have to be for the (double) _5 conversion and we do not want to expose (double) (long) (unsigned) _5 at that point because some targets (x86!) _can_ do SImode to DFmode float conversions and don't require an intermediate widening at all (vectorizable_conversion knows this). Richard. > > So for (double)(int)x you produce (double)(int)(unsigned int)x? It > > feels like the above example misses a conversion? > > > > No, (double)(int)x is basically (double)(long)(int)x > > And the pattern creates > > (double)(long)(unsigned long)x? Basically it just makes the conversions > explicit > and extends the zero extend as wide as possible. > > > I don't think the pure integer type sequence happens - we do > > (and should if not) already perform canonicalization of equivalent > > conversion sequences, otherwise you'd see a missed CSE (I couldn't > > produce such an example). > > It does. This is a pure integer sequence with the same problem > > void test8(unsigned char *x, long *y, int n, unsigned char k) { > for(int i = 0; i < n; i++) { > y[i] = k + x[i]; > } > } > > And there are other code that introduce this, for instance there's a cleanup > after > multiplication widening specifically that also tries to split types. And it > will do this > as well. > > > > > You say you found the "issue" to be exposed in several (or found in one, > > suspected in several) existing patterns. Can you elaborate and point > > out where this happens? I don't think it covers the int -> double > > case, does it? > > Not this case on its own. You have to do *some* operation in between. > > Thanks, > Tamar > > > > > > Thanks, > > > Tamar > > > > > > gcc/ChangeLog: > > > > > > * tree-vect-patterns.cc (vect_recog_zero_extend_chain_pattern): New. > > > > > > gcc/testsuite/ChangeLog: > > > > > > * gcc.dg/vect/bb-slp-pattern-1.c: Update tests. > > > * gcc.dg/vect/slp-widen-mult-half.c: Likewise. > > > * gcc.dg/vect/vect-over-widen-10.c: Likewise. > > > * gcc.dg/vect/vect-over-widen-12.c: Likewise. > > > * gcc.dg/vect/vect-over-widen-14.c: Likewise. > > > * gcc.dg/vect/vect-over-widen-16.c: Likewise. > > > * gcc.dg/vect/vect-over-widen-6.c: Likewise. > > > * gcc.dg/vect/vect-over-widen-8.c: Likewise. > > > * gcc.dg/vect/vect-widen-mult-u16.c: Likewise. > > > * gcc.dg/vect/vect-widen-mult-u8-s16-s32.c: Likewise. > > > * lib/target-supports.exp > > > (check_effective_target_vect_widen_mult_hi_to_si_pattern, > > > check_effective_target_vect_widen_mult_si_to_di_pattern): Enable > > > AArch64. > > > * gcc.target/aarch64/vect-tbl-zero-extend_2.c: New test. > > > > > > --- > > > diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-pattern-1.c > > b/gcc/testsuite/gcc.dg/vect/bb-slp-pattern-1.c > > > index > > 5ae99225273ca5f915f60ecba3a5aaedebe46e96..627de78af4e48581575beda9 > > 7bf2a0708ac091cb 100644 > > > --- a/gcc/testsuite/gcc.dg/vect/bb-slp-pattern-1.c > > > +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-pattern-1.c > > > @@ -52,4 +52,4 @@ int main (void) > > > > > > /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 > > > "slp2" { > > target { vect_widen_mult_hi_to_si || vect_unpack } } } } */ > > > /* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern: > > detected" 8 "slp2" { target vect_widen_mult_hi_to_si_pattern } } } */ > > > -/* { dg-final { scan-tree-dump-times "pattern recognized" 8 "slp2" { > > > target > > vect_widen_mult_hi_to_si_pattern } } } */ > > > +/* { dg-final { scan-tree-dump-times "widen_mult pattern recognized" 8 > > > "slp2" { > > target vect_widen_mult_hi_to_si_pattern } } } */ > > > diff --git a/gcc/testsuite/gcc.dg/vect/slp-widen-mult-half.c > > b/gcc/testsuite/gcc.dg/vect/slp-widen-mult-half.c > > > index > > b69ade338862cda4f44f5206d195eef1cb5e8d36..aecc085a51c93e0e7bed122df > > 0a77a0a099ad6ef 100644 > > > --- a/gcc/testsuite/gcc.dg/vect/slp-widen-mult-half.c > > > +++ b/gcc/testsuite/gcc.dg/vect/slp-widen-mult-half.c > > > @@ -52,5 +52,5 @@ int main (void) > > > /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { > > > target > > vect_widen_mult_hi_to_si } } } */ > > > /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 > > > "vect" { > > target vect_widen_mult_hi_to_si } } } */ > > > /* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern: > > detected" 2 "vect" { target vect_widen_mult_hi_to_si_pattern } } } */ > > > -/* { dg-final { scan-tree-dump-times "pattern recognized" 2 "vect" { > > > target > > vect_widen_mult_hi_to_si_pattern } } } */ > > > +/* { dg-final { scan-tree-dump-times "pattern recognized" 4 "vect" { > > > target > > vect_widen_mult_hi_to_si_pattern } } } */ > > > > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c > > b/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c > > > index > > f0140e4ef6d70cd61aa7dbb3ba39b1da142a79b2..bd798fae7e8136975d48820 > > 6cfef9e39fac2bfea 100644 > > > --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c > > > @@ -11,7 +11,7 @@ > > > > > > #include "vect-over-widen-9.c" > > > > > > -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */ > > > +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } > > > */ > > > /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: > > detected:[^\n]* \+ } "vect" } } */ > > > /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: > > detected:[^\n]* >> 1} "vect" } } */ > > > /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: > > detected:[^\n]* >> 2} "vect" } } */ > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-12.c > > b/gcc/testsuite/gcc.dg/vect/vect-over-widen-12.c > > > index > > ddb3bd8c0d378f0138c8cc7f9c6ea3300744b8a8..8c0544e35c29de60e76759f4 > > ed13206278c72925 100644 > > > --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-12.c > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-12.c > > > @@ -11,7 +11,7 @@ > > > > > > #include "vect-over-widen-11.c" > > > > > > -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */ > > > +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } > > > */ > > > /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: > > detected:[^\n]* \+ } "vect" } } */ > > > /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: > > detected:[^\n]* >> 1} "vect" } } */ > > > /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: > > detected:[^\n]* >> 2} "vect" } } */ > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c > > b/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c > > > index > > dfa09f5d2cafe329e6d57b5cc681786cc2c7d215..1fe0305c1c4f61d05864ef9778 > > 9726a1dc6ec8b1 100644 > > > --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c > > > @@ -11,7 +11,7 @@ > > > > > > #include "vect-over-widen-13.c" > > > > > > -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */ > > > +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } > > > */ > > > /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: > > detected:[^\n]* \+} "vect" } } */ > > > /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: > > detected:[^\n]* >> 1} "vect" } } */ > > > /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern: > > detected:[^\n]* = \(unsigned char\)} "vect" } } */ > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-16.c > > b/gcc/testsuite/gcc.dg/vect/vect-over-widen-16.c > > > index > > 4584c586da1e6f13e8c8de4c1291cea0141ebab5..4ecdadf7a035a4f83b1767a06 > > 3a1b0f47bdd543d 100644 > > > --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-16.c > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-16.c > > > @@ -11,7 +11,7 @@ > > > > > > #include "vect-over-widen-15.c" > > > > > > -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */ > > > +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } > > > */ > > > /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: > > detected:[^\n]* \+} "vect" } } */ > > > /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: > > detected:[^\n]* >> 1} "vect" } } */ > > > /* { dg-final { scan-tree-dump-not {vect_recog_cast_forwprop_pattern: > > detected} "vect" } } */ > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c > > b/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c > > > index > > bda92c965e080dd3f48ec42b6bea16e79d9416cd..6b8c3dfa2c89ce04d7673607 > > ef2d2f14a14eb32f 100644 > > > --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c > > > @@ -9,7 +9,7 @@ > > > > > > #include "vect-over-widen-5.c" > > > > > > -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */ > > > +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } > > > */ > > > /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: > > detected:[^\n]* \+ } "vect" } } */ > > > /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: > > detected:[^\n]* >> 1} "vect" } } */ > > > /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern: > > detected:[^\n]* \(unsigned char\)} "vect" } } */ > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c > > b/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c > > > index > > 553c0712a79a1d19195dbdab7cbd6fa330685bea..1cf725ff4b7f151097192db1 > > a0b65173c4c83b19 100644 > > > --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c > > > @@ -12,7 +12,7 @@ > > > > > > #include "vect-over-widen-7.c" > > > > > > -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */ > > > +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } > > > */ > > > /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: > > detected:[^\n]* \+ } "vect" } } */ > > > /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: > > detected:[^\n]* >> 2} "vect" } } */ > > > /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern: > > detected:[^\n]* \(unsigned char\)} "vect" } } */ > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u16.c > > b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u16.c > > > index > > 258d253f401459d448d1ae86f56b0c97815d5b61..b5018f855a72534b4d64d2 > > dc2b7ab2ac0deb674b 100644 > > > --- a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u16.c > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u16.c > > > @@ -47,5 +47,5 @@ int main (void) > > > > > > /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { > > > target { > > vect_widen_mult_hi_to_si || vect_unpack } } } } */ > > > /* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern: > > detected" 1 "vect" { target vect_widen_mult_hi_to_si_pattern } } } */ > > > -/* { dg-final { scan-tree-dump-times "pattern recognized" 1 "vect" { > > > target > > vect_widen_mult_hi_to_si_pattern } } } */ > > > +/* { dg-final { scan-tree-dump-times "widen_mult pattern recognized" 1 > > > "vect" { > > target vect_widen_mult_hi_to_si_pattern } } } */ > > > > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c > > b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c > > > index > > 3baafca7b548124ae5c48fdf3c2f07c319155967..ab523ca77652e1f1533889fda9 > > c0eb31c987ffe9 100644 > > > --- a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c > > > @@ -47,5 +47,5 @@ int main (void) > > > > > > /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { > > > target { > > vect_widen_mult_hi_to_si || vect_unpack } } } } */ > > > /* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern: > > detected" 1 "vect" { target vect_widen_mult_hi_to_si_pattern } } } */ > > > -/* { dg-final { scan-tree-dump-times "pattern recognized" 1 "vect" { > > > target > > vect_widen_mult_hi_to_si_pattern } } } */ > > > +/* { dg-final { scan-tree-dump-times "widen_mult pattern recognized" 1 > > > "vect" { > > target vect_widen_mult_hi_to_si_pattern } } } */ > > > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_2.c > > b/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_2.c > > > new file mode 100644 > > > index > > 0000000000000000000000000000000000000000..1577eacd9dbbb52274d9f > > 86c77406555b7726482 > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_2.c > > > @@ -0,0 +1,33 @@ > > > +/* { dg-do compile } */ > > > +/* { dg-additional-options "-O3 -std=c99 -march=armv8-a" } */ > > > + > > > +void test6(unsigned char *x, double *y, int n) { > > > + for(int i = 0; i < (n & -8); i++) { > > > + y[i] += x[i]; > > > + } > > > +} > > > + > > > +void test7(unsigned char *x, double *y, int n, unsigned char k) { > > > + for(int i = 0; i < (n & -8); i++) { > > > + y[i] += k * x[i]; > > > + } > > > +} > > > + > > > +void test8(unsigned char *x, double *y, int n, unsigned char k) { > > > + for(int i = 0; i < (n & -8); i++) { > > > + y[i] = k + x[i]; > > > + } > > > +} > > > + > > > +void test9(unsigned char *x, long long *y, int n, unsigned char k) { > > > + for(int i = 0; i < (n & -8); i++) { > > > + y[i] = k + x[i]; > > > + } > > > +} > > > + > > > +/* { dg-final { scan-assembler-times {\tuxtl} 1 } } */ > > > +/* { dg-final { scan-assembler-not {\tuxtl2} } } */ > > > +/* { dg-final { scan-assembler-not {\tzip1} } } */ > > > +/* { dg-final { scan-assembler-not {\tzip2} } } */ > > > +/* { dg-final { scan-assembler-times {\ttbl} 44 } } */ > > > +/* { dg-final { scan-assembler-times {\.LC[0-9]+:} 12 } } */ > > > diff --git a/gcc/testsuite/lib/target-supports.exp > > > b/gcc/testsuite/lib/target- > > supports.exp > > > index > > d113a08dff7b2a8ab5bdfe24386d271bff255afc..feae1b8fcf8cd7ab56a8c76c0cd3 > > 034c0a828724 100644 > > > --- a/gcc/testsuite/lib/target-supports.exp > > > +++ b/gcc/testsuite/lib/target-supports.exp > > > @@ -8240,6 +8240,7 @@ proc > > check_effective_target_vect_widen_mult_hi_to_si_pattern { } { > > > return [check_cached_effective_target_indexed > > vect_widen_mult_hi_to_si_pattern { > > > expr { [istarget powerpc*-*-*] > > > || [istarget ia64-*-*] > > > + || [istarget aarch64*-*-*] > > > || [istarget loongarch*-*-*] > > > || [istarget i?86-*-*] || [istarget x86_64-*-*] > > > || ([is-effective-target arm_neon] > > > @@ -8259,6 +8260,7 @@ proc > > check_effective_target_vect_widen_mult_si_to_di_pattern { } { > > > expr { [istarget ia64-*-*] > > > || [istarget i?86-*-*] || [istarget x86_64-*-*] > > > || [istarget loongarch*-*-*] > > > + || [istarget aarch64*-*-*] > > > || ([istarget s390*-*-*] > > > && [check_effective_target_s390_vx]) }}] > > > } > > > diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc > > > index > > 9bf8526ac995c6c2678b25f5df4316aec41333e0..74c7269a3ab15cba1ee2ef055 > > 6d25afda851f7f0 100644 > > > --- a/gcc/tree-vect-patterns.cc > > > +++ b/gcc/tree-vect-patterns.cc > > > @@ -5524,6 +5524,122 @@ vect_recog_mixed_size_cond_pattern (vec_info > > *vinfo, > > > return pattern_stmt; > > > } > > > > > > +/* Function vect_recog_zero_extend_chain_pattern > > > + > > > + Try to find the following pattern: > > > + > > > + type x_t; > > > + TYPE a_T, b_T, c_T; > > > > But a_T, b_T and c_T are different types - what types? > > > > > + loop: > > > + S1 a_T = (b_T)(c_T)x_t; > > > + > > > + where type 'TYPE' is an integral type which has different size > > > + from 'type' and c_T is a zero extend or a sign extend on a value > > > whose top > > > + bit is known to be zero. a_T can be signed or unsigned. > > > + > > > + Input: > > > + > > > + * STMT_VINFO: The stmt from which the pattern search begins. > > > + > > > + Output: > > > + > > > + * TYPE_OUT: The type of the output of this pattern. > > > + > > > + * Return value: A new stmt that will be used to replace the pattern. > > > + This replaces multiple chained extensions with the longest possible > > > + chain or zero extends and a final convert to the required sign. > > > + > > > + S1 a_T = (a_T)(unsigned a_T)x_t; */ > > > > So for (double)(int)x you produce (double)(int)(unsigned int)x? It > > feels like the above example misses a conversion? > > > > I don't think the pure integer type sequence happens - we do > > (and should if not) already perform canonicalization of equivalent > > conversion sequences, otherwise you'd see a missed CSE (I couldn't > > produce such an example). > > > > You say you found the "issue" to be exposed in several (or found in one, > > suspected in several) existing patterns. Can you elaborate and point > > out where this happens? I don't think it covers the int -> double > > case, does it? > > > > Thanks, > > Richard. > > > > > + > > > +static gimple * > > > +vect_recog_zero_extend_chain_pattern (vec_info *vinfo, > > > + stmt_vec_info stmt_vinfo, tree *type_out) > > > +{ > > > + gimple *last_stmt = STMT_VINFO_STMT (vect_stmt_to_vectorize > > (stmt_vinfo)); > > > + > > > + if (!is_gimple_assign (last_stmt)) > > > + return NULL; > > > + > > > + tree_code code = gimple_assign_rhs_code (last_stmt); > > > + tree lhs = gimple_assign_lhs (last_stmt); > > > + tree rhs = gimple_assign_rhs1 (last_stmt); > > > + tree lhs_type = TREE_TYPE (lhs); > > > + tree rhs_type = TREE_TYPE (rhs); > > > + > > > + if ((code != FLOAT_EXPR && code != NOP_EXPR) > > > + || TYPE_UNSIGNED (lhs_type) > > > + || TREE_CODE (rhs_type) != INTEGER_TYPE > > > + || TREE_CODE (rhs) != SSA_NAME > > > + || STMT_VINFO_DEF_TYPE (stmt_vinfo) != vect_internal_def) > > > + return NULL; > > > + > > > + /* Check to see if it's safe to extend the zero extend to the new type. > > > + In general this is safe if the rhs1 type is unsigned or if we know > > > that > > > + the top bits are zero, this can happen due to all the widening > > > operations > > > + we have. For instance a widening addition will have top bits > > > zero. */ > > > + if (!TYPE_UNSIGNED (rhs_type)) > > > + { > > > + wide_int wcst = get_nonzero_bits (rhs); > > > + if (wi::neg_p (wcst) || wi::clz (wcst) == 0) > > > + return NULL; > > > + } > > > + > > > + tree cvt_type = unsigned_type_for (lhs_type); > > > + > > > + tree cvt_vectype = get_vectype_for_scalar_type (vinfo, cvt_type); > > > + if (!cvt_vectype || !VECTOR_TYPE_P (cvt_vectype)) > > > + return NULL; > > > + > > > + tree out_vectype = get_vectype_for_scalar_type (vinfo, lhs_type); > > > + if (!out_vectype || !VECTOR_TYPE_P (out_vectype)) > > > + return NULL; > > > + > > > + stmt_vec_info irhs; > > > + > > > + gimple_ranger ranger; > > > + > > > + /* Dig through any existing conversions to see if we can extend the > > > zero > > > + extend chain across multiple converts. */ > > > + while ((irhs = vect_get_internal_def (vinfo, rhs))) > > > + { > > > + gimple *g_irhs = STMT_VINFO_STMT (irhs); > > > + if (!is_gimple_assign (g_irhs) > > > + || gimple_assign_rhs_code (g_irhs) != NOP_EXPR) > > > + break; > > > + > > > + /* See if we can consume the next conversion as well. To do this > > > it's > > > + best to use Ranger as it can see through the intermediate IL that the > > > + vectorizer creates throughout pattern matching. */ > > > + int_range_max r; > > > + ranger.range_of_stmt (r, g_irhs); > > > + wide_int nz = r.get_nonzero_bits (); > > > + if (wi::neg_p (nz) || wi::clz (nz) == 0) > > > + break; > > > + > > > + rhs = gimple_assign_rhs1 (g_irhs); > > > + } > > > + > > > + /* If the result is a no-op, or we've jumped over a truncate of sort, > > > or if > > > + nothing would change materially just leave it alone. */ > > > + if (TYPE_PRECISION (lhs_type) <= TYPE_PRECISION (TREE_TYPE (rhs)) > > > + || (code == FLOAT_EXPR && rhs == gimple_assign_rhs1 (last_stmt))) > > > + return NULL; > > > + > > > + vect_pattern_detected ("vect_recog_zero_extend_chain_pattern", > > > last_stmt); > > > + > > > + tree cast_var = vect_recog_temp_ssa_var (cvt_type, NULL); > > > + gimple *pattern_stmt = NULL; > > > + pattern_stmt = gimple_build_assign (cast_var, NOP_EXPR, rhs); > > > + append_pattern_def_seq (vinfo, stmt_vinfo, pattern_stmt, cvt_vectype); > > > + > > > + tree cvt_var = vect_recog_temp_ssa_var (lhs_type, NULL); > > > + pattern_stmt = gimple_build_assign (cvt_var, code, cast_var); > > > + > > > + *type_out = out_vectype; > > > + > > > + return pattern_stmt; > > > +} > > > + > > > > > > /* Helper function of vect_recog_bool_pattern. Called recursively, > > > return > > > true if bool VAR can and should be optimized that way. Assume it > > > shouldn't > > > @@ -7509,6 +7625,7 @@ static vect_recog_func vect_vect_recog_func_ptrs[] = > > { > > > { vect_recog_widen_minus_pattern, "widen_minus" }, > > > { vect_recog_widen_abd_pattern, "widen_abd" }, > > > /* These must come after the double widening ones. */ > > > + { vect_recog_zero_extend_chain_pattern, "zero_extend_chain" }, > > > }; > > > > > > /* Mark statements that are involved in a pattern. */ > > > > > > > > > > > > > > > > > > > -- > > Richard Biener <rguent...@suse.de> > > SUSE Software Solutions Germany GmbH, > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg) > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)