The experiment showed the following prefetching could reduce 20-30% of csum_partial_copy_generic() execution time. This suggests CPU cause heavy cache miss at checksum calclation and it is also supported by the PMC (performance monitoring counter) measurements. So I think the following statement is not fully acceptable. Linus Torvalds writes: > Proof: the data to be sent out is in RAM. In fact, often it is cached > in the CPU these days. In order to start sending out the packet, the > smart card has to move all of the data from RAM/cache over the bus to > the card. It can only start actually sending after that. Cost: bus > speed to copy it over. I posted the patch few times but no responce received. I think now is the good time to repost the code. prefetching is implemented by simple load instructions, so no dedicated instruction is needed and it is applicable Pentium-PRO or later. I think the code can save some part of overhead in the current copy-full implementation. diff -rc linux-2.3.99-pre8/arch/i386/lib/checksum.S linux-2.3.99-pre8-Pf/arch/i386/lib/checksum.S *** linux-2.3.99-pre8/arch/i386/lib/checksum.S Thu Mar 23 07:23:54 2000 --- linux-2.3.99-pre8-Pf/arch/i386/lib/checksum.S Mon May 15 22:45:41 2000 *************** *** 394,419 **** movl ARGBASE+8(%esp),%edi #dst movl ARGBASE+12(%esp),%ecx #len movl ARGBASE+16(%esp),%eax #sum ! movl %ecx, %edx movl %ecx, %ebx shrl $6, %ecx andl $0x3c, %ebx negl %ebx subl %ebx, %esi subl %ebx, %edi lea 3f(%ebx,%ebx), %ebx testl %esi, %esi jmp *%ebx 1: addl $64,%esi addl $64,%edi ROUND1(-64) ROUND(-60) ROUND(-56) ROUND(-52) ROUND (-48) ROUND(-44) ROUND(-40) ROUND(-36) ROUND (-32) ROUND(-28) ROUND(-24) ROUND(-20) ROUND (-16) ROUND(-12) ROUND(-8) ROUND(-4) 3: adcl $0,%eax dec %ecx jge 1b ! 4: andl $3, %edx jz 7f cmpl $2, %edx jb 5f --- 394,425 ---- movl ARGBASE+8(%esp),%edi #dst movl ARGBASE+12(%esp),%ecx #len movl ARGBASE+16(%esp),%eax #sum ! # movl %ecx, %edx movl %ecx, %ebx + movl %esi, %edx shrl $6, %ecx andl $0x3c, %ebx negl %ebx subl %ebx, %esi subl %ebx, %edi + lea -1(%esi),%edx + andl $-32,%edx lea 3f(%ebx,%ebx), %ebx testl %esi, %esi jmp *%ebx 1: addl $64,%esi addl $64,%edi + SRC(movb -32(%edx),%bl) ; SRC(movb (%edx),%bl) ROUND1(-64) ROUND(-60) ROUND(-56) ROUND(-52) ROUND (-48) ROUND(-44) ROUND(-40) ROUND(-36) ROUND (-32) ROUND(-28) ROUND(-24) ROUND(-20) ROUND (-16) ROUND(-12) ROUND(-8) ROUND(-4) 3: adcl $0,%eax + addl $64, %edx dec %ecx jge 1b ! 4: movl ARGBASE+12(%esp),%edx #len ! andl $3, %edx jz 7f cmpl $2, %edx jb 5f -- Computer Systems Laboratory, Fujitsu Labs. [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/

Reply via email to