Don't think it makes any difference, but: Richard Biener <rguent...@suse.de> writes: > @@ -2151,7 +2151,16 @@ get_group_load_store_type (vec_info *vinfo, > stmt_vec_info stmt_info, > access excess elements. > ??? Enhancements include peeling multiple iterations > or using masked loads with a static mask. */ > - || (group_size * cvf) % cnunits + group_size - gap < cnunits)) > + || ((group_size * cvf) % cnunits + group_size - gap < cnunits > + /* But peeling a single scalar iteration is enough if > + we can use the next power-of-two sized partial > + access. */ > + && ((cremain = (group_size * cvf - gap) % cnunits), true
...this might be less surprising as: && ((cremain = (group_size * cvf - gap) % cnunits, true) in terms of how the &&s line up. Thanks, Richard > + && ((cpart_size = (1 << ceil_log2 (cremain))) > + != cnunits) > + && vector_vector_composition_type > + (vectype, cnunits / cpart_size, > + &half_vtype) == NULL_TREE)))) > { > if (dump_enabled_p ()) > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > @@ -11599,6 +11608,27 @@ vectorizable_load (vec_info *vinfo, > gcc_assert (new_vtype > || LOOP_VINFO_PEELING_FOR_GAPS > (loop_vinfo)); > + /* But still reduce the access size to the next > + required power-of-two so peeling a single > + scalar iteration is sufficient. */ > + unsigned HOST_WIDE_INT cremain; > + if (remain.is_constant (&cremain)) > + { > + unsigned HOST_WIDE_INT cpart_size > + = 1 << ceil_log2 (cremain); > + if (known_gt (nunits, cpart_size) > + && constant_multiple_p (nunits, cpart_size, > + &num)) > + { > + tree ptype; > + new_vtype > + = vector_vector_composition_type (vectype, > + num, > + &ptype); > + if (new_vtype) > + ltype = ptype; > + } > + } > } > } > tree offset