> You forgot to look at PowerPC : > > http://cvs.opensolaris.org/source/xref/ppc-dev/ppc-dev/usr/src/lib/libc/ppc/gen/memcpy.s > > is that nice and small ?
I had to clear/check the whole 256 Mbytes SDRAM on a PPC system, and the fastest way I got (excluding DMA access) is by playing with the layer 1 cache of the processor. It seems that what takes the biggest time when doing a memcpy/memset is to *read* the data you are overwriting: after the first instruction which set the first word of the cache line, you have a cache line which seems to be populated (at least on the processor I was using). The bigger the word you are writing, the less has to be read from memory to fill the end of the cache line, but that is still wrong. The PPC has a very fast dcbz (data cache block zero) to clear memory, and also dcbi (data cache block invalidate) which permit to have a cached line caching an address without reading first the memory (when you plan to write the whole line). The code in opensolaris.org doesn't seem to handle that. It is probably difficult for the processor itself to detect if the repetition (%ecx for i386 rep) is big enough to decide to load or not load the destination cache line from memory, but I wonder if it is not his job. Obviously interruption/exceptions before finishing the rep are a problem. I am not a specialist in processor design, and my result may be due to my own bugs, or only on the processor I was using, just wanted to add that on the subject. Etienne. _____________________________________________________________________________ Envoyez avec Yahoo! Mail. Une boite mail plus intelligente http://mail.yahoo.fr