https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110449

--- Comment #2 from Hao Liu <hliu at amperecomputing dot com> ---
That looks better than the currently generated code (it saves one "MOV"
instruction). Yes, it has the loop-carried dependency advantage. But it still
uses one more register for "8*step" (There may be a register pressure problem
for complicated code, not for this simple case). 

This is still a floating point precision problem. There is a PR84201 discussed
about the same problem for X86:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84201. The larger step makes the
floating point calculation result has larger gap compared to the original
scalar calculation result. E.g. The SPEC2017 fp benchmark 549.fotonik may
result in VE (Validation Error) after unrolling a loop of double: 
   319    do ifreq = 1, tmppower%nofreq <------ HERE
   320      frequency(ifreq,ipower) = freq
   321      freq = freq + freqstep
   322    end do

it uses 4*step for unrolled vectorization version other than the 2*step for
non-unrolled vectorization version. The SPEC fp result checks the "relative
tolerance" of the fp results and it is higher than the current standard (i.e.
the compare command line option of "--reltol 1e-10").

Reply via email to