On Fri, Dec 2, 2011 at 11:36 AM, Andi Kleen <a...@firstfloor.org> wrote: > Teresa Johnson <tejohn...@google.com> writes: > > Interesting optimization. I would be concerned a little bit > about compile time, does it make a measurable difference? > >> The attached patch detects loops containing instructions that tend to >> incur high LCP (loop changing prefix) stalls on Core i7, and limits >> their unroll factor to try to keep the unrolled loop body small enough >> to fit in the Corei7's loop stream detector which can hide LCP stalls >> in loops. > > One more optimization would be to optimize padding for this case, > the LSD only works if the loop is not spread over too many 32 byte > chunks. So if you detect the loop is LSD worthy always pad to 32 bytes > at the beginning. > >> To do this I leveraged the existing TARGET_LOOP_UNROLL_ADJUST target >> hook, which was previously only defined for s390. I added one >> additional call to this target hook, when unrolling for constant trip >> count loops. Previously it was only called for runtime computed trip >> counts. Andreas, can you comment on the effect for s390 of this >> additional call of the target hook, since I can't measure that? > > On Sandy-Bridge there's also the decoded icache which is much larger, > but also has some restrictions. It would be nice if this optimization > was general enough to handle this case too. > > In general I notice that the tree loop unroller is too aggressive recently: > a lot of loops that probably shouldn't be unrolled (like containing > function calls etc.) are unrolled at -O3. So probably a better cost > model for unrolling would make sense anyways. >
Yes, I believe there are many target specific unroll tunings that can be done -- the current unroll target independent cost/benefit analysis is too weak. David >> + /* Don't reduce unroll factor in loops with floating point >> + computation, which tend to benefit more heavily from >> + larger unroll factors and are less likely to bottleneck >> + at the decoder. */ >> + has_FP = loop_has_FP_comp(loop); > > You could cache the loop body and pass it in here. > > Patch looks reasonable to me, but I cannot approve. > > -Andi > > -- > a...@linux.intel.com -- Speaking for myself only