On 07.01.2016 03:36, Tom Herbert wrote:
On Wed, Jan 6, 2016 at 5:52 PM, Hannes Frederic Sowa
wrote:
Hi Tom,
On 05.01.2016 19:41, Tom Herbert wrote:
--- /dev/null
+++ b/arch/x86/lib/csum-partial_64.S
@@ -0,0 +1,147 @@
+/* Copyright 2016 Tom Herbert
+ *
+ * Checksum partial calculation
+ *
+
On Wed, Jan 6, 2016 at 5:52 PM, Hannes Frederic Sowa
wrote:
> Hi Tom,
>
> On 05.01.2016 19:41, Tom Herbert wrote:
>>
>> --- /dev/null
>> +++ b/arch/x86/lib/csum-partial_64.S
>> @@ -0,0 +1,147 @@
>> +/* Copyright 2016 Tom Herbert
>> + *
>> + * Checksum partial calculation
>> + *
>> + * __wsum csum
Hi Tom,
On 05.01.2016 19:41, Tom Herbert wrote:
--- /dev/null
+++ b/arch/x86/lib/csum-partial_64.S
@@ -0,0 +1,147 @@
+/* Copyright 2016 Tom Herbert
+ *
+ * Checksum partial calculation
+ *
+ * __wsum csum_partial(const void *buff, int len, __wsum sum)
+ *
+ * Computes the checksum of a memory b
Tom Herbert writes:
> Also, we don't do anything special for alignment, unaligned
> accesses on x86 do not appear to be a performance issue.
This is not true on Atom CPUs.
Also on most CPUs there is still a larger penalty when crossing
cache lines.
> Verified correctness by testing arbitrary l
On Wed, 2016-01-06 at 14:49 +, David Laight wrote:
> Someone also pointed out that the code is memory limited (dual add
> chains making no difference), so why is it unrolled at all?
Because it matters if the data is already present in CPU caches.
So why not unrolling if it helps in some situ
From: Eric Dumazet
> Sent: 06 January 2016 14:25
> On Wed, 2016-01-06 at 10:16 +, David Laight wrote:
> > From: Eric Dumazet
> > > Sent: 05 January 2016 22:19
> > > To: Tom Herbert
> > > You might add a comment telling the '4' comes from length of 'adcq
> > > 6*8(%rdi),%rax' instruction, and th
On Wed, 2016-01-06 at 10:16 +, David Laight wrote:
> From: Eric Dumazet
> > Sent: 05 January 2016 22:19
> > To: Tom Herbert
> > You might add a comment telling the '4' comes from length of 'adcq
> > 6*8(%rdi),%rax' instruction, and that the 'nop' is to compensate that
> > 'adcq0*8(%rdi),%ra
From: Eric Dumazet
> Sent: 05 January 2016 22:19
> To: Tom Herbert
> You might add a comment telling the '4' comes from length of 'adcq
> 6*8(%rdi),%rax' instruction, and that the 'nop' is to compensate that
> 'adcq0*8(%rdi),%rax' is using 3 bytes instead.
>
> We also could use .byte 0x48, 0x1
On Wed, 2016-01-06 at 00:35 +0100, Hannes Frederic Sowa wrote:
>
> Tom, did you have a look if it makes sense to add a second carry
> addition train with the adcx instruction, which does not signal carry
> via the carry flag but with the overflow flag? This instruction should
> not have any de
On Tue, 2016-01-05 at 17:10 -0800, H. Peter Anvin wrote:
> Apparently "adcq.d8" will do The Right Thing for this.
Nice trick ;)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.o
On 01/05/2016 02:18 PM, Eric Dumazet wrote:
> On Tue, 2016-01-05 at 10:41 -0800, Tom Herbert wrote:
>> Implement assembly routine for csum_partial for 64 bit x86. This
>> primarily speeds up checksum calculation for smaller lengths such as
>> those that are present when doing skb_postpull_rcsum whe
Hi,
On 05.01.2016 19:41, Tom Herbert wrote:
Implement assembly routine for csum_partial for 64 bit x86. This
primarily speeds up checksum calculation for smaller lengths such as
those that are present when doing skb_postpull_rcsum when getting
CHECKSUM_COMPLETE from device or after CHECKSUM_UNNE
On Tue, 2016-01-05 at 10:41 -0800, Tom Herbert wrote:
> Implement assembly routine for csum_partial for 64 bit x86. This
> primarily speeds up checksum calculation for smaller lengths such as
> those that are present when doing skb_postpull_rcsum when getting
> CHECKSUM_COMPLETE from device or afte
Implement assembly routine for csum_partial for 64 bit x86. This
primarily speeds up checksum calculation for smaller lengths such as
those that are present when doing skb_postpull_rcsum when getting
CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY
conversion.
This implementation is sim
14 matches
Mail list logo