On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka <hubi...@ucw.cz> wrote: >> > libcall is not faster up to 8KB to rep sequence that is better for >> > regalloc/code >> > cache than fully blowin function call. >> >> Be careful with this. My recollection is that REP sequence is good for >> any size -- for smaller size, the REP initial set up cost is too high >> (10s of cycles), while for large size copy, it is less efficient >> compared with library version. > > Well this is based on the data from the memtest script. > Core has good REP implementation - it is a win from rather small blocks (16 > bytes if I recall) and it does not need alignment. > Library version starts to be interesting with caching hints, but I think till > 80KB > it is still not a win for my setup (glibc-2.15)
A simple test shows that -mstringop-strategy=libcall always beats -mstringop-strategy=rep_8byte (on core2 and corei7) except for size smaller than 8 where the rep_8byte strategy simply bypasses REP movs. Can you share your memtest ? thanks, David >> >> > >> >> > /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix >> >> > stall >> >> > * on 16-bit immediate moves into memory on Core2 and Corei7. */ >> >> > @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe >> >> > m_K6, >> >> > >> >> > /* X86_TUNE_USE_CLTD */ >> >> > - ~(m_PENT | m_ATOM | m_K6), >> >> > + ~(m_PENT | m_ATOM | m_K6 | m_GENERIC), >> >> My change was to enable CLTD for generic. Is your change intended to >> revert that? > > No, it is merge conflict, sorry. I will update it in my tree. >> > Skipping inc/dec is to avoid partial flag stall happening on P4 only. >> >> > >> >> >> K8 and K10 partitions the flags into groups. References to flags to >> the same group can still cause the stall -- not sure how that can be >> handled. > > I belive the stalls happends only in quite special cases where compare > instruction > combines flags from multiple instructions. GCC don't generate this type of > code, so > we should be safe. > > Honza