https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119298
--- Comment #7 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Hmm, the sequence does not use + at all, but I think I know what is going on.
While the field is called addss it is used as an kitchen sink for all other
simple operations.
/* pmuludq under sse2, pmuldq under sse4.1, for sign_extend,
require extra 4 mul, 4 add, 4 cmp and 2 shift. */
if (!TARGET_SSE4_1 && !uns_p)
extra_cost = (cost->mulss + cost->addss + cost->sse_op) * 4
+ cost->sse_op * 2;
....
case FLOAT_EXTEND:
if (!SSE_FLOAT_MODE_SSEMATH_OR_HFBF_P (mode))
*total = 0;
else
*total = ix86_vec_cost (mode, cost->addss);
return false;
....
case FLOAT_TRUNCATE:
if (!SSE_FLOAT_MODE_SSEMATH_OR_HFBF_P (mode))
*total = cost->fadd;
else
*total = ix86_vec_cost (mode, cost->addss);
return false;
...
case scalar_stmt:
return fp ? ix86_cost->addss : COSTS_N_INSNS (1);
....
case vector_stmt:
return ix86_vec_cost (mode,
fp ? ix86_cost->addss : ix86_cost->sse_op);
Only addss was sped up (and apparently only in common contextes), other simple
SSE operations are still 3 cycles.
We have
const int sse_op; /* cost of cheap SSE instruction. */
const int addss; /* cost of ADDSS/SD SUBSS/SD instructions. */
SSE_OP is used for integer SSE instructions, which are typically 1 cycle, so
perhaps we want to have also sse_fp_op /* Chose of cheap SSE fp instruction.
*/
in addition to addss.
But to be precise builtin_vectorizer cost would need to now if
scalar/vector_stmt is additio or something else, which AFAK it doesn't