https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110449
--- Comment #2 from Hao Liu <hliu at amperecomputing dot com> --- That looks better than the currently generated code (it saves one "MOV" instruction). Yes, it has the loop-carried dependency advantage. But it still uses one more register for "8*step" (There may be a register pressure problem for complicated code, not for this simple case). This is still a floating point precision problem. There is a PR84201 discussed about the same problem for X86: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84201. The larger step makes the floating point calculation result has larger gap compared to the original scalar calculation result. E.g. The SPEC2017 fp benchmark 549.fotonik may result in VE (Validation Error) after unrolling a loop of double: 319 do ifreq = 1, tmppower%nofreq <------ HERE 320 frequency(ifreq,ipower) = freq 321 freq = freq + freqstep 322 end do it uses 4*step for unrolled vectorization version other than the 2*step for non-unrolled vectorization version. The SPEC fp result checks the "relative tolerance" of the fp results and it is higher than the current standard (i.e. the compare command line option of "--reltol 1e-10").