I have another idea for sse, and this one is far safer:

only use sse prefetch, leave the string operations for the actual copy.
The prefetch operations only prefetch, don't touch the sse registers,
thus neither any reentency nor interrupt problems.

I tried the attached hack^H^H^H^Hpatch, and read(fd, buf, 4000000) from
user space got 7% faster (from 264768842 cycles to 246303748 cycles,
single cpu, noacpi, 'linux -b', fastest time from several thousand
runs).

The reason why this works is simple:

Intel Pentium III and P 4 have hardcoded "fast stringcopy" operations
that invalidate whole cachelines during write (documented in the most
obvious place: multiprocessor management, memory ordering)

The result is a very fast write, but the read is still slow.

--
        Manfred
--- 2.4/mm/filemap.c    Wed Feb 14 10:51:42 2001
+++ build-2.4/mm/filemap.c      Wed Feb 14 22:11:44 2001
@@ -1248,6 +1248,20 @@
                size = count;
 
        kaddr = kmap(page);
+       if (size > 128) {
+               int i;
+               __asm__ __volatile__(
+                       "mov %1, %0\n\t"
+                       : "=r" (i)
+                       : "r" (kaddr+offset)); /* load tlb entry */
+               for(i=0;i<size;i+=64) {
+                       __asm__ __volatile__(
+                               "prefetchnta (%1, %0)\n\t"
+                               "prefetchnta 32(%1, %0)\n\t"
+                               : /* no output */
+                               : "r" (i), "r" (kaddr+offset));
+               }
+       }
        left = __copy_to_user(desc->buf, kaddr + offset, size);
        kunmap(page);
        

Reply via email to