On Fri, Dec 2, 2011 at 11:36 AM, Andi Kleen <a...@firstfloor.org> wrote:
> Teresa Johnson <tejohn...@google.com> writes:
>
> Interesting optimization. I would be concerned a little bit
> about compile time, does it make a measurable difference?
>
>> The attached patch detects loops containing instructions that tend to
>> incur high LCP (loop changing prefix) stalls on Core i7, and limits
>> their unroll factor to try to keep the unrolled loop body small enough
>> to fit in the Corei7's loop stream detector which can hide LCP stalls
>> in loops.
>
> One more optimization would be to optimize padding for this case,
> the LSD only works if the loop is not spread over too many 32 byte
> chunks. So if you detect the loop is LSD worthy always pad to 32 bytes
> at the beginning.
>
>> To do this I leveraged the existing TARGET_LOOP_UNROLL_ADJUST target
>> hook, which was previously only defined for s390. I added one
>> additional call to this target hook, when unrolling for constant trip
>> count loops. Previously it was only called for runtime computed trip
>> counts. Andreas, can you comment on the effect for s390 of this
>> additional call of the target hook, since I can't measure that?
>
> On Sandy-Bridge there's also the decoded icache which is much larger,
> but also has some restrictions. It would be nice if this optimization
> was general enough to handle this case too.
>
> In general I notice that the tree loop unroller is too aggressive recently:
> a lot of loops that probably shouldn't be unrolled (like containing
> function calls etc.) are unrolled at -O3. So probably a better cost
> model for unrolling would make sense anyways.
>

Yes, I believe there are many target specific unroll tunings that can
be done -- the current unroll target independent cost/benefit analysis
is too weak.

David



>> +  /* Don't reduce unroll factor in loops with floating point
>> +     computation, which tend to benefit more heavily from
>> +     larger unroll factors and are less likely to bottleneck
>> +     at the decoder. */
>> +  has_FP = loop_has_FP_comp(loop);
>
> You could cache the loop body and pass it in here.
>
> Patch looks reasonable to me, but I cannot approve.
>
> -Andi
>
> --
> a...@linux.intel.com -- Speaking for myself only

Reply via email to