prodyut hazarika writes: > glibc memxxx for powerpc are horribly inefficient. For optimal performance, > we should should dcbt instruction to establish the source address in cache, > and > dcbz to establish the destination address in cache. We should do > dcbt and dcbz such that the touches happen a line ahead of the actual copy. > > The problem which is see is that dcbt and dcbz instructions don't work on > non-cacheable memory (obviously!). But memxxx function are used for both > cached and non-cached memory. Thus this optimized memcpy should be smart > enough > to figure out that both source and destination address fall in > cacheable space, and only then > used the optimized dcbt/dcbz instructions.
I would be careful about adding overhead to memcpy. I found that in the kernel, almost all calls to memcpy are for less than 128 bytes (1 cache line on most 64-bit machines). So, adding a lot of code to detect cacheability and do prefetching is just going to slow down the common case, which is short copies. I don't have statistics for glibc but I wouldn't be surprised if most copies were short there also. The other thing that I have found is that code that is optimal for cache-cold copies is usually significantly slower than optimal for cache-hot copies, because the cache management instructions consume cycles and don't help in the cache-hot case. In other words, I don't think we should be tuning the glibc memcpy based on tests of how fast it copies multiple megabytes. Still, for 6xx/e300 cores, we probably do want to use dcbt/dcbz for larger copies. We don't want to use dcbt/dcbz on the larger 64-bit processors (POWER4/5/6) because the hardware prefetching and write-combining mean that dcbt/dcbz don't help and just slow things down. Paul. _______________________________________________ Linuxppc-dev mailing list [email protected] https://ozlabs.org/mailman/listinfo/linuxppc-dev
