Hi, while analyzing a test case with a lot of nested loops (>7) and double floating point operations I noticed a performance regression of GCC 6/7 vs GCC 5 on s390x. It seems due to GCC 6 vectorizing something GCC 5 couldn't. Basically, each loop iterates over three dimensions, we fully unroll some of the inner loops until we have straight-line code of roughly 2000 insns that are being executed three times in GCC 5. GCC 6 vectorizes two iterations and adds a scalar epilogue for the third iteration. The epilogue code is so bad that it slows down the execution by at least 50%, using only two hard registers and lots of spill slots. Although my analysis is not completed, I believe this is because register pressure is high in the epilogue and the live ranges span the vectorized code as well as the epilogue.
Even reduced, the test case is huge, therefore I didn't include it. Some high-level questions instead: - Has anybody else observed similar problems and got around them? - Is there some way around the register pressure/long live ranges? Perhaps something we could/should fix in the s390 backend? (Probably hard to tell without source) - Would it make sense to allow a backend to specify the minimal number of loop iterations considered for vectorization? Is this perhaps already possible somehow? I added a check to disable vectorization for loops with <= 3 iterations that shows no regressions and improves two SPEC benchmarks noticeably. I'm even considering <=5, since a vectorization factor of 4 should exhibit the same problematic pattern. Regards Robin