https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110474
Bug ID: 110474
Summary: Vect: the epilog vect loop should have small VF if the
loop is unrolled during vectorization
Product: gcc
Version: 14.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: hliu at amperecomputing dot com
Target Milestone: ---
Hi, I'm trying to use tune loop unrolling during vectorization (see more:
tree-vect-loop.cc suggested_unroll_factor). I find the unrolling may hurt
performance as unrolling also increases the VF (vector factor) of epilog vect
loop.
For example:
int foo(short *A, char *B, int N) {
int sum = 0;
for (int i = 0; i < N; ++i) {
sum += A[i] * B[i];
}
return sum;
}
Compile it with "-O3 -mtune=neoverse-n2 -mcpu=neoverse-n1 --param
aarch64-vect-unroll-limit=2" (I'm using -mcpu n1 as I want to try a target
without SVE). GCC vectorization pass unrolls the loop by 2 and generates code
as following:
if N >= 32:
main vect loop ...
if N >= 16: # This may hurt performance if N is small (e.g. 8)
epilog vect loop ...
epilog scalar code ...
If the loop is not unrolled (i.e. use "--param aarch64-vect-unroll-limit=1").
GCC generates code as following:
if N >= 16:
main vect loop ...
if N >= 8:
epilog vect loop ...
epilog scalar code ...
The runtime check is based on the VF of epilog vectorization. There is code in
tree-vect-loop.cc (line 2990) to choose epilog vect VF:
/* If we're vectorizing an epilogue loop, the vectorized loop either needs
to be able to handle fewer than VF scalars, or needs to have a lower VF
than the main loop. */
if (LOOP_VINFO_EPILOGUE_P (loop_vinfo)
&& !LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
&& maybe_ge (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
LOOP_VINFO_VECT_FACTOR (orig_loop_vinfo)))
return opt_result::failure_at (vect_location,
"Vectorization factor too high for"
" epilogue loop.\n");
But it doesn't consider about the suggested_unroll_factor. So I'm thinking
about adding following code to unscale the orig_loop_vinfo's VF by
unroll_factor:
unscaled_orig_vf = exact_div (LOOP_VINFO_VECT_FACTOR (orig_loop_vinfo),
orig_loop_vinfo->suggested_unroll_factor);
Is this reasonable?