Teresa Johnson <tejohn...@google.com> writes: Interesting optimization. I would be concerned a little bit about compile time, does it make a measurable difference?
> The attached patch detects loops containing instructions that tend to > incur high LCP (loop changing prefix) stalls on Core i7, and limits > their unroll factor to try to keep the unrolled loop body small enough > to fit in the Corei7's loop stream detector which can hide LCP stalls > in loops. One more optimization would be to optimize padding for this case, the LSD only works if the loop is not spread over too many 32 byte chunks. So if you detect the loop is LSD worthy always pad to 32 bytes at the beginning. > To do this I leveraged the existing TARGET_LOOP_UNROLL_ADJUST target > hook, which was previously only defined for s390. I added one > additional call to this target hook, when unrolling for constant trip > count loops. Previously it was only called for runtime computed trip > counts. Andreas, can you comment on the effect for s390 of this > additional call of the target hook, since I can't measure that? On Sandy-Bridge there's also the decoded icache which is much larger, but also has some restrictions. It would be nice if this optimization was general enough to handle this case too. In general I notice that the tree loop unroller is too aggressive recently: a lot of loops that probably shouldn't be unrolled (like containing function calls etc.) are unrolled at -O3. So probably a better cost model for unrolling would make sense anyways. > + /* Don't reduce unroll factor in loops with floating point > + computation, which tend to benefit more heavily from > + larger unroll factors and are less likely to bottleneck > + at the decoder. */ > + has_FP = loop_has_FP_comp(loop); You could cache the loop body and pass it in here. Patch looks reasonable to me, but I cannot approve. -Andi -- a...@linux.intel.com -- Speaking for myself only