Hi,

while analyzing a test case with a lot of nested loops (>7) and double
floating point operations I noticed a performance regression of GCC 6/7
vs GCC 5 on s390x. It seems due to GCC 6 vectorizing something GCC 5
couldn't.
 Basically, each loop iterates over three dimensions, we fully unroll
some of the inner loops until we have straight-line code of roughly 2000
insns that are being executed three times in GCC 5. GCC 6 vectorizes two
iterations and adds a scalar epilogue for the third iteration. The
epilogue code is so bad that it slows down the execution by at least
50%, using only two hard registers and lots of spill slots.
Although my analysis is not completed, I believe this is because
register pressure is high in the epilogue and the live ranges span the
vectorized code as well as the epilogue.

Even reduced, the test case is huge, therefore I didn't include it. Some
high-level questions instead:

- Has anybody else observed similar problems and got around them?

- Is there some way around the register pressure/long live ranges?
Perhaps something we could/should fix in the s390 backend? (Probably
hard to tell without source)

- Would it make sense to allow a backend to specify the minimal number
of loop iterations considered for vectorization? Is this
perhaps already possible somehow? I added a check to disable
vectorization for loops with <= 3 iterations that shows no regressions
and improves two SPEC benchmarks noticeably. I'm even considering <=5,
since a vectorization factor of 4 should exhibit the same problematic
pattern.

Regards
 Robin

Reply via email to