On Fri, 15 Mar 2019, Jan Hubicka wrote: > > > > A previous patch of mine correcting the vectorizer target cost model > > to properly cost scalar FP ops vs. scalar INT ops regressed > > 416.gamess by ~10% on all modern x86 archs. > > > > The following mitigates this in the cost modeling by noticing > > the vectorized loop in question has all loads and stores performed > > strided (built up from scalar loads/stores) and building upon > > the pessimization of strided loads added last year. > > > > The first half is treating strided stores the same as strided > > loads which may make sense (but the latency and dependence > > arguments do not count here). Unfortunately that alone > > doesn't make 416.gamess vectorization fail because we end up > > with TYPE_VECTOR_SUBPARTS == 2 (AVX256 vectorization is rejected > > due to cost reasons already). Now comes the second half > > which is to push it over the edge, adjusting the previous > > pessimization by multiplying with TYPE_VECTOR_SUBPARTS + 1 > > instead of just TYPE_VECTOR_SUBPARTS which makes the biggest > > difference for smaller vectors. > > > > I've benchmarked this on a Haswell machine with SPEC 2006 > > confirming the regression is fixed and re-benchmarked > > appearant regressions with 3 runs confirming that was noise > > and we end up with maybe even a progression there > > (see the bugzilla audit-trail for details). > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu. > > > > OK for trunk? > > > > Note I'm going to apply as two revisions to allow bisection > > between the two changes, first pushing pessimizing strided > > stores and then adjusting the factor. > > > > Thanks, > > Richard. > > > > 2019-03-15 Richard Biener <rguent...@suse.de> > > > > PR target/87561 > > * config/i386/i386.c (ix86_add_stmt_cost): Apply strided > > load pessimization to stores as well. > > * config/i386/i386.c (ix86_add_stmt_cost): Pessimize strided > > loads and stores a bit more. > > Looks good to me. Store costs are even more iffy than other costs > because they are not part of dependency chain,so I guess whatever seems > to work best in practice is good.
Applied as r269753 and r269754. Please report any issue with this. Richard.