epilogue for modern CPUs

Xinliang David Li Wed, 12 Dec 2012 22:09:31 -0800

On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka <hubi...@ucw.cz> wrote:
>> > libcall is not faster up to 8KB to rep sequence that is better for 
>> > regalloc/code
>> > cache than fully blowin function call.
>>
>> Be careful with this. My recollection is that REP sequence is good for
>> any size -- for smaller size, the REP initial set up cost is too high
>> (10s of cycles), while for large size copy, it is less efficient
>> compared with library version.
>
> Well this is based on the data from the memtest script.
> Core has good REP implementation - it is a win from rather small blocks (16
> bytes if I recall) and it does not need alignment.
> Library version starts to be interesting with caching hints, but I think till 
> 80KB
> it is still not a win for my setup (glibc-2.15)


A simple test shows that -mstringop-strategy=libcall always beats
-mstringop-strategy=rep_8byte (on core2 and corei7) except for size
smaller than 8 where the rep_8byte strategy simply bypasses REP movs.
Can you share your memtest ?

thanks,

David

>> >> >
>> >> >    /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix 
>> >> > stall
>> >> >     * on 16-bit immediate moves into memory on Core2 and Corei7.  */
>> >> > @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
>> >> >    m_K6,
>> >> >
>> >> >    /* X86_TUNE_USE_CLTD */
>> >> > -  ~(m_PENT | m_ATOM | m_K6),
>> >> > +  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),
>>
>> My change was to enable CLTD for generic. Is your change intended to
>> revert that?
>
> No, it is merge conflict, sorry.  I will update it in my tree.
>> > Skipping inc/dec is to avoid partial flag stall happening on P4 only.
>> >> >
>>
>>
>> K8 and K10 partitions the flags into groups. References to flags to
>> the same group can still cause the stall -- not sure how that can be
>> handled.
>
> I  belive the stalls happends only in quite special cases where compare 
> instruction
> combines flags from multiple instructions.  GCC don't generate this type of 
> code, so
> we should be safe.
>
> Honza

Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

Reply via email to