https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57952
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |jakub at gcc dot gnu.org
--- Comment #4 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
The reason why #c1 (as well as #c0) is only vectorized using vector length of 8
rather than 4 is that the loop iterator is cast to float and therefore needed
inside of the loop in vector registers:
pr57952.C:21:20: note: op not supported by target.
pr57952.C:21:20: note: not vectorized: relevant stmt not supported: i_16 = i_41
+ 1;
pr57952.C:21:20: note: bad operation or unsupported loop bound.
and AVX doesn't support V8SImode addition.
Now, perhaps we could have an optimization that in that case if all the
iterators can be provably exactly represented in the floating point value we
could try to do what the programmer should have done, i.e. add a float iterator
that is set to 1.0f and incremented in each iteration and used instead of
float(i). But it won't work in this case, because you need 24 bits for the
iterator and float only has 23 bit mantissa.
for (int k=0; k!=100; ++k) {
float c = 1.f/10000000.f;
float fi = 1.f;
for (int i=1; i<10000001; ++i) { s+= polyHorner((fi+float(k))*c); fi +=
1.f; }
}
is vectorized with -Ofast -mavx just fine vectorization factor of 8.
As for #c2/#c3, GCC 4.9 is not supported anymore and the dumps are too large to
find out what exactly you mean by efficient and not efficient, both the ICC and
GCC generated assemblies use both %ymm and %xmm registers depending on what
exactly the need.