On Mon, 14 Aug 2023, Prathamesh Kulkarni wrote: > On Mon, 7 Aug 2023 at 13:19, Richard Biener <richard.guent...@gmail.com> > wrote: > > > > On Mon, Aug 7, 2023 at 2:05?AM Prathamesh Kulkarni via Gcc-patches > > <gcc-patches@gcc.gnu.org> wrote: > > > > > > On Thu, 3 Aug 2023 at 17:48, Richard Biener <rguent...@suse.de> wrote: > > > > > > > > On Thu, 3 Aug 2023, Richard Biener wrote: > > > > > > > > > On Thu, 3 Aug 2023, Richard Biener wrote: > > > > > > > > > > > On Thu, 3 Aug 2023, Prathamesh Kulkarni wrote: > > > > > > > > > > > > > On Wed, 2 Aug 2023 at 14:17, Richard Biener via Gcc-patches > > > > > > > <gcc-patches@gcc.gnu.org> wrote: > > > > > > > > > > > > > > > > On Mon, 31 Jul 2023, Jeff Law wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 7/28/23 01:05, Richard Biener via Gcc-patches wrote: > > > > > > > > > > The following delays sinking of loads within the same > > > > > > > > > > innermost > > > > > > > > > > loop when it was unconditional before. That's a not > > > > > > > > > > uncommon > > > > > > > > > > issue preventing vectorization when masked loads are not > > > > > > > > > > available. > > > > > > > > > > > > > > > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu. > > > > > > > > > > > > > > > > > > > > I have a followup patch improving sinking that without this > > > > > > > > > > would > > > > > > > > > > cause more of the problematic sinking - now that we have a > > > > > > > > > > second > > > > > > > > > > sink pass after loop opts this looks like a reasonable > > > > > > > > > > approach? > > > > > > > > > > > > > > > > > > > > OK? > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Richard. > > > > > > > > > > > > > > > > > > > > PR tree-optimization/92335 > > > > > > > > > > * tree-ssa-sink.cc (select_best_block): Before loop > > > > > > > > > > optimizations avoid sinking unconditional loads/stores > > > > > > > > > > in innermost loops to conditional executed places. > > > > > > > > > > > > > > > > > > > > * gcc.dg/tree-ssa/ssa-sink-10.c: Disable vectorizing. > > > > > > > > > > * gcc.dg/tree-ssa/predcom-9.c: Clone from ssa-sink-10.c, > > > > > > > > > > expect predictive commoning to happen instead of sinking. > > > > > > > > > > * gcc.dg/vect/pr65947-3.c: Adjust. > > > > > > > > > I think it's reasonable -- there's probably going to be cases > > > > > > > > > where it's not > > > > > > > > > great, but more often than not I think it's going to be a > > > > > > > > > reasonable > > > > > > > > > heuristic. > > > > > > > > > > > > > > > > > > If there is undesirable fallout, better to find it over the > > > > > > > > > coming months than > > > > > > > > > next spring. So I'd suggest we go forward now to give more > > > > > > > > > time to find any > > > > > > > > > pathological cases (if they exist). > > > > > > > > > > > > > > > > Agreed, I've pushed this now. > > > > > > > Hi Richard, > > > > > > > After this patch (committed in > > > > > > > 399c8dd44ff44f4b496223c7cc980651c4d6f6a0), > > > > > > > pr65947-7.c "failed" for aarch64-linux-gnu: > > > > > > > FAIL: gcc.dg/vect/pr65947-7.c scan-tree-dump-not vect "LOOP > > > > > > > VECTORIZED" > > > > > > > FAIL: gcc.dg/vect/pr65947-7.c -flto -ffat-lto-objects > > > > > > > scan-tree-dump-not vect "LOOP VECTORIZED" > > > > > > > > > > > > > > /* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" { > > > > > > > target { > > > > > > > ! vect_fold_extract_last } } } } */ > > > > > > > > > > > > > > With your commit, condition_reduction in pr65947-7.c gets > > > > > > > vectorized > > > > > > > regardless of vect_fold_extract_last, > > > > > > > which gates the above test (which is an improvement, because the > > > > > > > function didn't get vectorized before the commit). > > > > > > > > > > > > > > The attached patch thus removes the gating on > > > > > > > vect_fold_extract_last, > > > > > > > and the test passes again. > > > > > > > OK to commit ? > > > > > > > > > > > > OK. > > > > > > > > > > Or wait - the loop doesn't vectorize on x86_64, so I guess one > > > > > critical target condition is missing. Can you figure out which? > > > > > > > > I see > > > > > > > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21: > > > > note: vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(15)>, > > > > type of def: reduction > > > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21: > > > > note: vect_is_simple_use: vectype vector(4) int > > > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21: > > > > missed: multiple types in double reduction or condition reduction or > > > > fold-left reduction. > > > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:13:1: > > > > missed: not vectorized: relevant phi not supported: last_19 = PHI > > > > <last_8(7), 108(15)> > > > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21: > > > > missed: bad operation or unsupported loop bound. > > > Hi Richard, > > > Looking at the aarch64 vect dump, it seems the loop in > > > condition_reduction gets vectorized with V4HI mode > > > while fails for other modes in vectorizable_condition: > > > > > > if ((double_reduc || reduction_type != TREE_CODE_REDUCTION) > > > && ncopies > 1) > > > { > > > if (dump_enabled_p ()) > > > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > > > "multiple types in double reduction or condition > > > " > > > "reduction or fold-left reduction.\n"); > > > return false; > > > } > > > > > > From the dump: > > > foo.c:9:21: note: === vect_analyze_loop_operations === > > > foo.c:9:21: note: examining phi: last_19 = PHI <last_8(7), 108(15)> > > > foo.c:9:21: note: vect_is_simple_use: operand (int) aval_13, type of > > > def: internal > > > foo.c:9:21: note: vect_is_simple_use: vectype vector(4) int > > > foo.c:9:21: note: vect_is_simple_use: operand last_19 = PHI > > > <last_8(7), 108(15)>, type of def: reduction > > > foo.c:9:21: note: vect_is_simple_use: vectype vector(4) int > > > > > > For V8HI, VF = 8, and vectype_in = vector(4) int. > > > Thus ncopies = VF / length(vectype_in) = 2, which is greater than 1, > > > and thus fails: > > > foo.c:9:21: missed: multiple types in double reduction or condition > > > reduction or fold-left reduction. > > > foo.c:4:1: missed: not vectorized: relevant phi not supported: > > > last_19 = PHI <last_8(7), 108(15)> > > > While for V4HI, VF = 4 and thus ncopies = 1, so it succeeds. > > > > > > For x86_64, it seems the vectorizer doesn't seem to try V4HI mode. > > > If I "force" the vectorizer to use V4HI mode, we get the following dump: > > > foo.c:9:21: note: === vect_analyze_loop_operations === > > > foo.c:9:21: note: examining phi: last_19 = PHI <last_8(7), 108(15)> > > > foo.c:9:21: note: vect_is_simple_use: operand (int) aval_13, type of > > > def: internal > > > foo.c:9:21: note: vect_is_simple_use: vectype vector(2) int > > > foo.c:9:21: note: vect_is_simple_use: operand last_19 = PHI > > > <last_8(7), 108(15)>, type of def: reduction > > > foo.c:9:21: note: vect_is_simple_use: vectype vector(2) int > > > foo.c:9:21: missed: multiple types in double reduction or condition > > > reduction or fold-left reduction. > > > > > > Not sure tho if this is the only reason for the test to fail to > > > vectorize on the target. > > > Will investigate in more details next week. > > > > The odd thing is that you say > > > > for (int i = 0; i < N; i++) > > { > > aval = a[i]; > > if (b[i] < min_v) > > last = aval; > > } > > > > fails to vectorize but > > > > for (int i = 0; i < N; i++) > > { > > if (b[i] < min_v) > > last = a[i]; > > } > > > > succeeds? The IL difference should be irrelevant for the reduction > Hi Richard, > Sorry for late response. > No this case containing a conditional load doesn't vectorize on aarch64 > either: > foo2.c:9:21: note: === analyze_loop_nest === > foo2.c:9:21: note: === vect_analyze_loop_form === > foo2.c:9:21: missed: not vectorized: control flow in loop. > foo2.c:9:21: missed: bad loop form. > > For this test: > for (int i = 0; i < N; i++) > { > aval = a[i]; > if (b[i] < min_v) > last = aval; > } > > IIUC sink pass made the load conditional preventing vectorization > (similar to above), > but your PR92335 fix delays the sinking of load before loop opts, and > thus gets vectorized. > Till vect pass, the dumps are similar for x86 and aarch64. > > vectorization: > > > > <bb 3> [local count: 1049367889]: > > # last_19 = PHI <last_8(7), 108(15)> > > # i_21 = PHI <i_17(7), 0(15)> > > # ivtmp_18 = PHI <ivtmp_10(7), 43(15)> > > _1 = (long unsigned int) i_21; > > _2 = _1 * 2; > > _3 = a_12(D) + _2; > > aval_13 = *_3; > > _5 = _1 * 4; > > _6 = b_14(D) + _5; > > _7 = *_6; > > last_16 = (int) aval_13; > > _9 = _7 < min_v_15(D); > > last_8 = _9 ? last_16 : last_19; > > i_17 = i_21 + 1; > > ivtmp_10 = ivtmp_18 - 1; > > if (ivtmp_10 != 0) > > goto <bb 7>; [97.68%] > > > > vs > > > > <bb 3> [local count: 1049367889]: > > # last_19 = PHI <last_9(7), 108(15)> > > # i_21 = PHI <i_17(7), 0(15)> > > # ivtmp_11 = PHI <ivtmp_10(7), 43(15)> > > _1 = (long unsigned int) i_21; > > _2 = _1 * 4; > > _3 = b_13(D) + _2; > > _4 = *_3; > > _5 = _4 < min_v_14(D); > > _6 = _1 * 2; > > _38 = _37 + _6; > > _7 = (short int *) _38; > > _8 = .MASK_LOAD (_7, 16B, _5); > > last_16 = (int) _8; > > last_9 = _5 ? last_16 : last_19; > > i_17 = i_21 + 1; > > ivtmp_10 = ivtmp_11 - 1; > > if (ivtmp_10 != 0) > > goto <bb 7>; [97.68%] > > > > maybe since the "mask" is used twice with the .MASK_LOAD > > we are not actually looking at the def (the comparison) and it's > > the comparison which would introduce the "multiple types"? > > > > That is, I wonder why not sinking the load, avoiding a conditional > > load, makes a difference to vectorizing the condition/extract last > > reduction. > IIUC, the issue is that the vector type used for reduction seems to be > different than > the vector type used for determining VF, and it passes the above check > in vectoriable_reduction, > only if VF matches the length of the vector type used for reduction. > > For V4HI mode on aarch64, it sets VF = 4 which matches with the length of > vector > type used for reduction (vector (4) int): > foo.c:9:21: note: examining phi: last_19 = PHI <last_8(7), 108(15)> > foo.c:9:21: note: vect_is_simple_use: operand (int) aval_13, type of > def: internal > foo.c:9:21: note: vect_is_simple_use: vectype vector(4) int > foo.c:9:21: note: vect_is_simple_use: operand last_19 = PHI > <last_8(7), 108(15)>, type of def: reduction > foo.c:9:21: note: vect_is_simple_use: vectype vector(4) int
so it doesn't use a fold_extract_last reduction but a regular condition reduction? How does it end up with a V4HI + V4SI combo here? Ah, so the "key" is that we end up using vect_last_16.23_74 = (vector(4) int) vect_aval_13.19_70; vect_last_8.25_77 = VEC_COND_EXPR <mask__9.24_76, vect_last_16.23_74, vect_last_19.16_67>; so we can promote V4HI to V4SI via direct conversion instead of via unpacking. The x86 backend has no such feat, while it can do zero_extend via punpcklwd doing a sign_extend requires a compare, unpacking the result with itself and then doing the punpcklwd ontop of that. Of course that's what we're doing when using vec_unpack_lo/hi_expr already. > > > > It doesn't seem to make a difference for x86. That said, the "fix" is > > probably sticking the correct target on the dump-check, it seems > > that vect_fold_extract_last is no longer correct here. > Um sorry, I did go thru various checks in target-supports.exp, but not > sure which one will be appropriate for this case, > and am stuck here :/ Could you please suggest how to proceed ? Maybe Richard S. knows the magic thing to test, he originally implemented the direct conversion support. I suggest to implement such dg-checks if they are not present (I can't find them), possibly quite specific to the modes involved (like we have other checks with _qi_to_hi suffixes, for float modes maybe just _float). Richard.