On Wed, 12 Jun 2024, Richard Biener wrote: > On Tue, 11 Jun 2024, Richard Sandiford wrote: > > > Don't think it makes any difference, but: > > > > Richard Biener <rguent...@suse.de> writes: > > > @@ -2151,7 +2151,16 @@ get_group_load_store_type (vec_info *vinfo, > > > stmt_vec_info stmt_info, > > > access excess elements. > > > ??? Enhancements include peeling multiple iterations > > > or using masked loads with a static mask. */ > > > - || (group_size * cvf) % cnunits + group_size - gap < cnunits)) > > > + || ((group_size * cvf) % cnunits + group_size - gap < cnunits > > > + /* But peeling a single scalar iteration is enough if > > > + we can use the next power-of-two sized partial > > > + access. */ > > > + && ((cremain = (group_size * cvf - gap) % cnunits), true > > > > ...this might be less surprising as: > > > > && ((cremain = (group_size * cvf - gap) % cnunits, true) > > > > in terms of how the &&s line up. > > Yeah - I'll fix before pushing.
The aarch64 CI shows that a few testcases no longer use SVE (gcc.target/aarch64/sve/slp_perm_{4,7,8}.c) because peeling for gaps is deemed isufficient. Formerly we had if (loop_vinfo && *memory_access_type == VMAT_CONTIGUOUS && SLP_TREE_LOAD_PERMUTATION (slp_node).exists () && !multiple_p (group_size * LOOP_VINFO_VECT_FACTOR (loop_vinfo), nunits)) { unsigned HOST_WIDE_INT cnunits, cvf; if (!can_overrun_p || !nunits.is_constant (&cnunits) || !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant (&cvf) /* Peeling for gaps assumes that a single scalar iteration is enough to make sure the last vector iteration doesn't access excess elements. ??? Enhancements include peeling multiple iterations or using masked loads with a static mask. */ || (group_size * cvf) % cnunits + group_size - gap < cnunits) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, "peeling for gaps insufficient for " "access\n"); and in all cases multiple_p (group_size * LOOP_VINFO_VECT_FACTOR, nunits) is true so we didn't check for whether peeling one iteration is sufficient. But after the refactoring the outer checks merely indicates there's overrun (which is there already because gap != 0). That is, we never verified, for the "regular" gap case, whether peeling for a single iteration is sufficient. But now of course we run into the inner check which will always trigger if earlier checks didn't work out to set overrun_p to false. For slp_perm_8.c we have a group_size of two, nunits is {16, 16} and VF is {8, 8} and gap is one. Given we know the multiple_p we know that (group_size * cvf) % cnunits is zero, so what remains is group_size - gap < nunits but 1 is probably always less than {16, 16}. The new logic I added in the later patch that peeling a single iteration is OK when we use a smaller, rounded-up to power-of-two sized access is || ((group_size * cvf) % cnunits + group_size - gap < cnunits /* But peeling a single scalar iteration is enough if we can use the next power-of-two sized partial access. */ && (cremain = (group_size * cvf - gap) % cnunits, true) && (cpart_size = (1 << ceil_log2 (cremain))) != cnunits && vector_vector_composition_type (vectype, cnunits / cpart_size, &half_vtype) == NULL_TREE))) again knowing the multiple we know cremain is nunits - gap and with gap == 1 rounding this size up will yield nunits and thus the existing peeling is OK. Something is inconsistent here and the pre-existing (group_size * cvf) % cnunits + group_size - gap < cnunits check looks suspicious for a general check. (group_size * cvf - gap) should be the number of elements we can access without touching excess elements. Peeling a single iteration will make sure group_size * cvf + group_size - gap is accessed (that's group_size * (cvf + 1) - gap). The excess elements touched in the vector loop are cnunits - (group_size * cvf - gap) % cnunits I think that number needs to be less or equal to group_size, so the correct check should be (group_size * cvf - gap) % cnunits + group_size < cnunits for the SVE case that's (nunits - 1) + 2 < nunits which should simplify to false. Now the question is how to formulate this with poly-ints in a way that it works out, for the case in question doing the overrun check as if (overrun_p && can_div_trunc_p (group_size * LOOP_VINFO_VECT_FACTOR (loop_vinfo) - gap, nunits, &tem, &remain) && maybe_lt (remain + group_size, nunits)) seems to do the trick. I'm going to test this, will post an updated series for the CI. Richard. > Thanks, > Richard. > > > Thanks, > > Richard > > > > > + && ((cpart_size = (1 << ceil_log2 (cremain))) > > > + != cnunits) > > > + && vector_vector_composition_type > > > + (vectype, cnunits / cpart_size, > > > + &half_vtype) == NULL_TREE)))) > > > { > > > if (dump_enabled_p ()) > > > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > > > @@ -11599,6 +11608,27 @@ vectorizable_load (vec_info *vinfo, > > > gcc_assert (new_vtype > > > || LOOP_VINFO_PEELING_FOR_GAPS > > > (loop_vinfo)); > > > + /* But still reduce the access size to the next > > > + required power-of-two so peeling a single > > > + scalar iteration is sufficient. */ > > > + unsigned HOST_WIDE_INT cremain; > > > + if (remain.is_constant (&cremain)) > > > + { > > > + unsigned HOST_WIDE_INT cpart_size > > > + = 1 << ceil_log2 (cremain); > > > + if (known_gt (nunits, cpart_size) > > > + && constant_multiple_p (nunits, cpart_size, > > > + &num)) > > > + { > > > + tree ptype; > > > + new_vtype > > > + = vector_vector_composition_type (vectype, > > > + num, > > > + &ptype); > > > + if (new_vtype) > > > + ltype = ptype; > > > + } > > > + } > > > } > > > } > > > tree offset > > > > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)