On Fri, Dec 2, 2011 at 11:36 AM, Andi Kleen <a...@firstfloor.org> wrote: > Teresa Johnson <tejohn...@google.com> writes: > > Interesting optimization. I would be concerned a little bit > about compile time, does it make a measurable difference?
I haven't measured compile time explicitly, but I don't it should, especially after I address your efficiency suggestion (see below), since it will just have one pass over the instructions in innermost loops. > >> The attached patch detects loops containing instructions that tend to >> incur high LCP (loop changing prefix) stalls on Core i7, and limits >> their unroll factor to try to keep the unrolled loop body small enough >> to fit in the Corei7's loop stream detector which can hide LCP stalls >> in loops. > > One more optimization would be to optimize padding for this case, > the LSD only works if the loop is not spread over too many 32 byte > chunks. So if you detect the loop is LSD worthy always pad to 32 bytes > at the beginning. Thanks for the suggestion, I will look at doing that in follow-on tuning. > >> To do this I leveraged the existing TARGET_LOOP_UNROLL_ADJUST target >> hook, which was previously only defined for s390. I added one >> additional call to this target hook, when unrolling for constant trip >> count loops. Previously it was only called for runtime computed trip >> counts. Andreas, can you comment on the effect for s390 of this >> additional call of the target hook, since I can't measure that? > > On Sandy-Bridge there's also the decoded icache which is much larger, > but also has some restrictions. It would be nice if this optimization > was general enough to handle this case too. > > In general I notice that the tree loop unroller is too aggressive recently: > a lot of loops that probably shouldn't be unrolled (like containing > function calls etc.) are unrolled at -O3. So probably a better cost > model for unrolling would make sense anyways. These are both good suggestions, and I will look into Sandy Bridge heuristics in follow-on work, since we will need to tune for that soon. > >> + /* Don't reduce unroll factor in loops with floating point >> + computation, which tend to benefit more heavily from >> + larger unroll factors and are less likely to bottleneck >> + at the decoder. */ >> + has_FP = loop_has_FP_comp(loop); > > You could cache the loop body and pass it in here. That is a great idea, and in fact, I think I will do away with this separate function completely for this patch. I can more efficiently look for the FP computation while I am looking for the half word stores. I'll do that and send a follow up with the new patch. > > Patch looks reasonable to me, but I cannot approve. Thanks! Teresa > > -Andi > > -- > a...@linux.intel.com -- Speaking for myself only -- Teresa Johnson | Software Engineer | tejohn...@google.com | 408-460-2413