On Thu, Jan 26, 2017 at 10:18 AM, Robin Dapp <rd...@linux.vnet.ibm.com> wrote: > Hi, > > while analyzing a test case with a lot of nested loops (>7) and double > floating point operations I noticed a performance regression of GCC 6/7 > vs GCC 5 on s390x. It seems due to GCC 6 vectorizing something GCC 5 > couldn't. > Basically, each loop iterates over three dimensions, we fully unroll > some of the inner loops until we have straight-line code of roughly 2000 > insns that are being executed three times in GCC 5. GCC 6 vectorizes two > iterations and adds a scalar epilogue for the third iteration. The > epilogue code is so bad that it slows down the execution by at least > 50%, using only two hard registers and lots of spill slots. > Although my analysis is not completed, I believe this is because > register pressure is high in the epilogue and the live ranges span the > vectorized code as well as the epilogue. > > Even reduced, the test case is huge, therefore I didn't include it. Some > high-level questions instead: > > - Has anybody else observed similar problems and got around them? Yes, I think so. Also we have case that GCC vectorizes with larger vect_factor, which causes regression too.
> > - Is there some way around the register pressure/long live ranges? I am doing some experiments calculating coarse-grained register pressure for GIMPLE loop, but the motivation is not from vectorizer, but predcom/pre, like PR77498. > Perhaps something we could/should fix in the s390 backend? (Probably > hard to tell without source) > > - Would it make sense to allow a backend to specify the minimal number > of loop iterations considered for vectorization? Is this > perhaps already possible somehow? I added a check to disable > vectorization for loops with <= 3 iterations that shows no regressions > and improves two SPEC benchmarks noticeably. I'm even considering <=5, > since a vectorization factor of 4 should exhibit the same problematic > pattern. Is the niter number known at compilation time? if yes, I am surprised GCC's behavior here on such small iteration loops. Cost-model? Thanks, bin > > Regards > Robin >