> > libcall is not faster up to 8KB to rep sequence that is better for > > regalloc/code > > cache than fully blowin function call. > > Be careful with this. My recollection is that REP sequence is good for > any size -- for smaller size, the REP initial set up cost is too high > (10s of cycles), while for large size copy, it is less efficient > compared with library version.
Well this is based on the data from the memtest script. Core has good REP implementation - it is a win from rather small blocks (16 bytes if I recall) and it does not need alignment. Library version starts to be interesting with caching hints, but I think till 80KB it is still not a win for my setup (glibc-2.15) > >> > > >> > /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall > >> > * on 16-bit immediate moves into memory on Core2 and Corei7. */ > >> > @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe > >> > m_K6, > >> > > >> > /* X86_TUNE_USE_CLTD */ > >> > - ~(m_PENT | m_ATOM | m_K6), > >> > + ~(m_PENT | m_ATOM | m_K6 | m_GENERIC), > > My change was to enable CLTD for generic. Is your change intended to > revert that? No, it is merge conflict, sorry. I will update it in my tree. > > Skipping inc/dec is to avoid partial flag stall happening on P4 only. > >> > > > > K8 and K10 partitions the flags into groups. References to flags to > the same group can still cause the stall -- not sure how that can be > handled. I belive the stalls happends only in quite special cases where compare instruction combines flags from multiple instructions. GCC don't generate this type of code, so we should be safe. Honza