> I would be careful about adding overhead to memcpy. I found that in > the kernel, almost all calls to memcpy are for less than 128 bytes (1 > cache line on most 64-bit machines). So, adding a lot of code to > detect cacheability and do prefetching is just going to slow down the > common case, which is short copies. I don't have statistics for glibc > but I wouldn't be surprised if most copies were short there also. >
You are right. For small copy, it is not advisable. The way I did was put a small check in the beginning of memcpy. If the copy is less than 5 cache lines, I don't do dcbt/dcbz. Thus we see a big jump for copy more than 5 cache lines. The overhead is only 2 assembly instructions (compare number of bytes followed by jump). One question - How can we can quickly determine whether both source and memory address range fall in cacheable range? The user can mmap a region of memory as non-cacheable, but then call memcpy with that address. The optimized version must quickly determine that dcbt/dcbz must not be used in this case. I don't know what would be a good way to achieve the same? Regards, Prodyut Hazarika _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev