On Mon, 14 Aug 2023, Prathamesh Kulkarni wrote:

> On Mon, 7 Aug 2023 at 13:19, Richard Biener <richard.guent...@gmail.com> 
> wrote:
> >
> > On Mon, Aug 7, 2023 at 2:05?AM Prathamesh Kulkarni via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> > >
> > > On Thu, 3 Aug 2023 at 17:48, Richard Biener <rguent...@suse.de> wrote:
> > > >
> > > > On Thu, 3 Aug 2023, Richard Biener wrote:
> > > >
> > > > > On Thu, 3 Aug 2023, Richard Biener wrote:
> > > > >
> > > > > > On Thu, 3 Aug 2023, Prathamesh Kulkarni wrote:
> > > > > >
> > > > > > > On Wed, 2 Aug 2023 at 14:17, Richard Biener via Gcc-patches
> > > > > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > > > > >
> > > > > > > > On Mon, 31 Jul 2023, Jeff Law wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On 7/28/23 01:05, Richard Biener via Gcc-patches wrote:
> > > > > > > > > > The following delays sinking of loads within the same 
> > > > > > > > > > innermost
> > > > > > > > > > loop when it was unconditional before.  That's a not 
> > > > > > > > > > uncommon
> > > > > > > > > > issue preventing vectorization when masked loads are not 
> > > > > > > > > > available.
> > > > > > > > > >
> > > > > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> > > > > > > > > >
> > > > > > > > > > I have a followup patch improving sinking that without this 
> > > > > > > > > > would
> > > > > > > > > > cause more of the problematic sinking - now that we have a 
> > > > > > > > > > second
> > > > > > > > > > sink pass after loop opts this looks like a reasonable 
> > > > > > > > > > approach?
> > > > > > > > > >
> > > > > > > > > > OK?
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Richard.
> > > > > > > > > >
> > > > > > > > > >  PR tree-optimization/92335
> > > > > > > > > >  * tree-ssa-sink.cc (select_best_block): Before loop
> > > > > > > > > >  optimizations avoid sinking unconditional loads/stores
> > > > > > > > > >  in innermost loops to conditional executed places.
> > > > > > > > > >
> > > > > > > > > >  * gcc.dg/tree-ssa/ssa-sink-10.c: Disable vectorizing.
> > > > > > > > > >  * gcc.dg/tree-ssa/predcom-9.c: Clone from ssa-sink-10.c,
> > > > > > > > > >  expect predictive commoning to happen instead of sinking.
> > > > > > > > > >  * gcc.dg/vect/pr65947-3.c: Adjust.
> > > > > > > > > I think it's reasonable -- there's probably going to be cases 
> > > > > > > > > where it's not
> > > > > > > > > great, but more often than not I think it's going to be a 
> > > > > > > > > reasonable
> > > > > > > > > heuristic.
> > > > > > > > >
> > > > > > > > > If there is undesirable fallout, better to find it over the 
> > > > > > > > > coming months than
> > > > > > > > > next spring.  So I'd suggest we go forward now to give more 
> > > > > > > > > time to find any
> > > > > > > > > pathological cases (if they exist).
> > > > > > > >
> > > > > > > > Agreed, I've pushed this now.
> > > > > > > Hi Richard,
> > > > > > > After this patch (committed in 
> > > > > > > 399c8dd44ff44f4b496223c7cc980651c4d6f6a0),
> > > > > > > pr65947-7.c "failed" for aarch64-linux-gnu:
> > > > > > > FAIL: gcc.dg/vect/pr65947-7.c scan-tree-dump-not vect "LOOP 
> > > > > > > VECTORIZED"
> > > > > > > FAIL: gcc.dg/vect/pr65947-7.c -flto -ffat-lto-objects
> > > > > > > scan-tree-dump-not vect "LOOP VECTORIZED"
> > > > > > >
> > > > > > > /* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" { 
> > > > > > > target {
> > > > > > > ! vect_fold_extract_last } } } } */
> > > > > > >
> > > > > > > With your commit, condition_reduction in pr65947-7.c gets 
> > > > > > > vectorized
> > > > > > > regardless of vect_fold_extract_last,
> > > > > > > which gates the above test (which is an improvement, because the
> > > > > > > function didn't get vectorized before the commit).
> > > > > > >
> > > > > > > The attached patch thus removes the gating on 
> > > > > > > vect_fold_extract_last,
> > > > > > > and the test passes again.
> > > > > > > OK to commit ?
> > > > > >
> > > > > > OK.
> > > > >
> > > > > Or wait - the loop doesn't vectorize on x86_64, so I guess one
> > > > > critical target condition is missing.  Can you figure out which?
> > > >
> > > > I see
> > > >
> > > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21:
> > > > note:   vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(15)>,
> > > > type of def: reduction
> > > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21:
> > > > note:   vect_is_simple_use: vectype vector(4) int
> > > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21:
> > > > missed:   multiple types in double reduction or condition reduction or
> > > > fold-left reduction.
> > > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:13:1:
> > > > missed:   not vectorized: relevant phi not supported: last_19 = PHI
> > > > <last_8(7), 108(15)>
> > > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21:
> > > > missed:  bad operation or unsupported loop bound.
> > > Hi Richard,
> > > Looking at the aarch64 vect dump, it seems the loop in
> > > condition_reduction gets vectorized with V4HI mode
> > > while fails for other modes in vectorizable_condition:
> > >
> > >   if ((double_reduc || reduction_type != TREE_CODE_REDUCTION)
> > >       && ncopies > 1)
> > >     {
> > >       if (dump_enabled_p ())
> > >         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > >                          "multiple types in double reduction or condition 
> > > "
> > >                          "reduction or fold-left reduction.\n");
> > >       return false;
> > >     }
> > >
> > > From the dump:
> > > foo.c:9:21: note:   === vect_analyze_loop_operations ===
> > > foo.c:9:21: note:   examining phi: last_19 = PHI <last_8(7), 108(15)>
> > > foo.c:9:21: note:   vect_is_simple_use: operand (int) aval_13, type of
> > > def: internal
> > > foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) int
> > > foo.c:9:21: note:   vect_is_simple_use: operand last_19 = PHI
> > > <last_8(7), 108(15)>, type of def: reduction
> > > foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) int
> > >
> > > For V8HI, VF = 8, and vectype_in = vector(4) int.
> > > Thus ncopies = VF / length(vectype_in) = 2, which is greater than 1,
> > > and thus fails:
> > > foo.c:9:21: missed:   multiple types in double reduction or condition
> > > reduction or fold-left reduction.
> > > foo.c:4:1: missed:   not vectorized: relevant phi not supported:
> > > last_19 = PHI <last_8(7), 108(15)>
> > > While for V4HI, VF = 4 and thus ncopies = 1, so it succeeds.
> > >
> > > For x86_64, it seems the vectorizer doesn't seem to try V4HI mode.
> > > If I "force" the vectorizer to use V4HI mode, we get the following dump:
> > > foo.c:9:21: note:   === vect_analyze_loop_operations ===
> > > foo.c:9:21: note:   examining phi: last_19 = PHI <last_8(7), 108(15)>
> > > foo.c:9:21: note:   vect_is_simple_use: operand (int) aval_13, type of
> > > def: internal
> > > foo.c:9:21: note:   vect_is_simple_use: vectype vector(2) int
> > > foo.c:9:21: note:   vect_is_simple_use: operand last_19 = PHI
> > > <last_8(7), 108(15)>, type of def: reduction
> > > foo.c:9:21: note:   vect_is_simple_use: vectype vector(2) int
> > > foo.c:9:21: missed:   multiple types in double reduction or condition
> > > reduction or fold-left reduction.
> > >
> > > Not sure tho if this is the only reason for the test to fail to
> > > vectorize on the target.
> > > Will investigate in more details next week.
> >
> > The odd thing is that you say
> >
> >   for (int i = 0; i < N; i++)
> >     {
> >       aval = a[i];
> >       if (b[i] < min_v)
> >         last = aval;
> >     }
> >
> > fails to vectorize but
> >
> >   for (int i = 0; i < N; i++)
> >     {
> >       if (b[i] < min_v)
> >         last = a[i];
> >     }
> >
> > succeeds?  The IL difference should be irrelevant for the reduction
> Hi Richard,
> Sorry for late response.
> No this case containing a conditional load doesn't vectorize on aarch64 
> either:
> foo2.c:9:21: note:  === analyze_loop_nest ===
> foo2.c:9:21: note:   === vect_analyze_loop_form ===
> foo2.c:9:21: missed:   not vectorized: control flow in loop.
> foo2.c:9:21: missed:  bad loop form.
> 
> For this test:
> for (int i = 0; i < N; i++)
>     {
>       aval = a[i];
>       if (b[i] < min_v)
>         last = aval;
>     }
> 
> IIUC sink pass made the load conditional preventing vectorization
> (similar to above),
> but your PR92335 fix delays the sinking of load before loop opts, and
> thus gets vectorized.
> Till vect pass, the dumps are similar for x86 and aarch64.
> > vectorization:
> >
> >   <bb 3> [local count: 1049367889]:
> >   # last_19 = PHI <last_8(7), 108(15)>
> >   # i_21 = PHI <i_17(7), 0(15)>
> >   # ivtmp_18 = PHI <ivtmp_10(7), 43(15)>
> >   _1 = (long unsigned int) i_21;
> >   _2 = _1 * 2;
> >   _3 = a_12(D) + _2;
> >   aval_13 = *_3;
> >   _5 = _1 * 4;
> >   _6 = b_14(D) + _5;
> >   _7 = *_6;
> >   last_16 = (int) aval_13;
> >   _9 = _7 < min_v_15(D);
> >   last_8 = _9 ? last_16 : last_19;
> >   i_17 = i_21 + 1;
> >   ivtmp_10 = ivtmp_18 - 1;
> >   if (ivtmp_10 != 0)
> >     goto <bb 7>; [97.68%]
> >
> > vs
> >
> >   <bb 3> [local count: 1049367889]:
> >   # last_19 = PHI <last_9(7), 108(15)>
> >   # i_21 = PHI <i_17(7), 0(15)>
> >   # ivtmp_11 = PHI <ivtmp_10(7), 43(15)>
> >   _1 = (long unsigned int) i_21;
> >   _2 = _1 * 4;
> >   _3 = b_13(D) + _2;
> >   _4 = *_3;
> >   _5 = _4 < min_v_14(D);
> >   _6 = _1 * 2;
> >   _38 = _37 + _6;
> >   _7 = (short int *) _38;
> >   _8 = .MASK_LOAD (_7, 16B, _5);
> >   last_16 = (int) _8;
> >   last_9 = _5 ? last_16 : last_19;
> >   i_17 = i_21 + 1;
> >   ivtmp_10 = ivtmp_11 - 1;
> >   if (ivtmp_10 != 0)
> >     goto <bb 7>; [97.68%]
> >
> > maybe since the "mask" is used twice with the .MASK_LOAD
> > we are not actually looking at the def (the comparison) and it's
> > the comparison which would introduce the "multiple types"?
> >
> > That is, I wonder why not sinking the load, avoiding a conditional
> > load, makes a difference to vectorizing the condition/extract last 
> > reduction.
> IIUC, the issue is that the vector type used for reduction seems to be
> different than
> the vector type used for determining VF, and it passes the above check
> in vectoriable_reduction,
> only if VF matches the length of the vector type used for reduction.
> 
> For V4HI mode on aarch64, it sets VF = 4 which matches with the length of 
> vector
> type used for reduction (vector (4) int):
> foo.c:9:21: note:   examining phi: last_19 = PHI <last_8(7), 108(15)>
> foo.c:9:21: note:   vect_is_simple_use: operand (int) aval_13, type of
> def: internal
> foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) int
> foo.c:9:21: note:   vect_is_simple_use: operand last_19 = PHI
> <last_8(7), 108(15)>, type of def: reduction
> foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) int

so it doesn't use a fold_extract_last reduction but a regular condition
reduction?  How does it end up with a V4HI + V4SI combo here?
Ah, so the "key" is that we end up using

  vect_last_16.23_74 = (vector(4) int) vect_aval_13.19_70;
  vect_last_8.25_77 = VEC_COND_EXPR <mask__9.24_76, vect_last_16.23_74, 
vect_last_19.16_67>;

so we can promote V4HI to V4SI via direct conversion instead of
via unpacking.  The x86 backend has no such feat, while it can
do zero_extend via punpcklwd doing a sign_extend requires a
compare, unpacking the result with itself and then doing the
punpcklwd ontop of that.  Of course that's what we're doing
when using vec_unpack_lo/hi_expr already.

> >
> > It doesn't seem to make a difference for x86.  That said, the "fix" is
> > probably sticking the correct target on the dump-check, it seems
> > that vect_fold_extract_last is no longer correct here.
> Um sorry, I did go thru various checks in target-supports.exp, but not
> sure which one will be appropriate for this case,
> and am stuck here :/ Could you please suggest how to proceed ?

Maybe Richard S. knows the magic thing to test, he originally
implemented the direct conversion support.  I suggest to implement
such dg-checks if they are not present (I can't find them),
possibly quite specific to the modes involved (like we have
other checks with _qi_to_hi suffixes, for float modes maybe
just _float).

Richard.

Reply via email to