A previous patch of mine correcting the vectorizer target cost model to properly cost scalar FP ops vs. scalar INT ops regressed 416.gamess by ~10% on all modern x86 archs.
The following mitigates this in the cost modeling by noticing the vectorized loop in question has all loads and stores performed strided (built up from scalar loads/stores) and building upon the pessimization of strided loads added last year. The first half is treating strided stores the same as strided loads which may make sense (but the latency and dependence arguments do not count here). Unfortunately that alone doesn't make 416.gamess vectorization fail because we end up with TYPE_VECTOR_SUBPARTS == 2 (AVX256 vectorization is rejected due to cost reasons already). Now comes the second half which is to push it over the edge, adjusting the previous pessimization by multiplying with TYPE_VECTOR_SUBPARTS + 1 instead of just TYPE_VECTOR_SUBPARTS which makes the biggest difference for smaller vectors. I've benchmarked this on a Haswell machine with SPEC 2006 confirming the regression is fixed and re-benchmarked appearant regressions with 3 runs confirming that was noise and we end up with maybe even a progression there (see the bugzilla audit-trail for details). Bootstrapped and tested on x86_64-unknown-linux-gnu. OK for trunk? Note I'm going to apply as two revisions to allow bisection between the two changes, first pushing pessimizing strided stores and then adjusting the factor. Thanks, Richard. 2019-03-15 Richard Biener <rguent...@suse.de> PR target/87561 * config/i386/i386.c (ix86_add_stmt_cost): Apply strided load pessimization to stores as well. * config/i386/i386.c (ix86_add_stmt_cost): Pessimize strided loads and stores a bit more. Index: gcc/config/i386/i386.c =================================================================== --- gcc/config/i386/i386.c (revision 269683) +++ gcc/config/i386/i386.c (working copy) @@ -50534,14 +50534,15 @@ ix86_add_stmt_cost (void *data, int coun latency and execution resources for the many scalar loads (AGU and load ports). Try to account for this by scaling the construction cost by the number of elements involved. */ - if (kind == vec_construct + if ((kind == vec_construct || kind == vec_to_scalar) && stmt_info - && STMT_VINFO_TYPE (stmt_info) == load_vec_info_type + && (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type + || STMT_VINFO_TYPE (stmt_info) == store_vec_info_type) && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE && TREE_CODE (DR_STEP (STMT_VINFO_DATA_REF (stmt_info))) != INTEGER_CST) { stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); - stmt_cost *= TYPE_VECTOR_SUBPARTS (vectype); + stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1); } if (stmt_cost == -1) stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);