From: George Spelvin
> Sent: 10 February 2016 14:44
...
> > I think the fastest loop is:
> > 10: adcq0(%rdi,%rcx,8),%rax
> > inc %rcx
> > jnz 10b
> > That loop looks like it will have no overhead on recent cpu.
>
> Well, it should execute at 1 instruction/cycle.
I presume you
David Laight wrote:
> Separate renaming allows:
> 1) The value to tested without waiting for pending updates to complete.
>Useful for IE and DIR.
I don't quite follow. It allows the value to be tested without waiting
for pending updates *of other bits* to complete.
Obviusly, the update of th
From: George Spelvin
> Sent: 10 February 2016 00:54
> To: David Laight; linux-ker...@vger.kernel.org; li...@horizon.com;
> netdev@vger.kernel.org;
> David Laight wrote:
> > Since adcx and adox must execute in parallel I clearly need to re-remember
> > how dependencies against the flags register wo
David Laight wrote:
> Since adcx and adox must execute in parallel I clearly need to re-remember
> how dependencies against the flags register work. I'm sure I remember
> issues with 'false dependencies' against the flags.
The issue is with flags register bits that are *not* modified by
an instruc
From: George Spelvin [mailto:li...@horizon.com]
> Sent: 08 February 2016 20:13
> David Laight wrote:
> > I'd need convincing that unrolling the loop like that gives any significant
> > gain.
> > You have a dependency chain on the carry flag so have delays between the
> > 'adcq'
> > instructions (
David Laight wrote:
> I'd need convincing that unrolling the loop like that gives any significant
> gain.
> You have a dependency chain on the carry flag so have delays between the
> 'adcq'
> instructions (these may be more significant than the memory reads from l1
> cache).
If the carry chain
From: Ingo Molnar
...
> As Linus noticed, data lookup tables are the intelligent solution: if you
> manage
> to offload the logic into arithmetics and not affect the control flow then
> that's
> a big win. The inherent branching will be hidden by executing on massively
> parallel arithmetics unit
* Tom Herbert wrote:
> Thanks for the explanation and sample code. Expanding on your example, I
> added a
> switch statement to perform to function (code below).
So I think your new switch() based testcase is broken in a subtle way.
The problem is that in your added testcase GCC effectively o
* Tom Herbert wrote:
> [] gcc turns these switch statements into jump tables (not function
> tables
> which is what Ingo's example code was using). [...]
So to the extent this still matters, on most x86 microarchitectures that count,
jump tables and function call tables (i.e. virtual fun
On Thu, Feb 4, 2016 at 5:27 PM, Linus Torvalds
wrote:
> sum = csum_partial_lt8(*(unsigned long *)buff, len, sum);
> return rotate_by8_if_odd(sum, align);
Actually, that last word-sized access to "buff" might be past the end
of the buffer. The code does the right thing if "len" is
On Thu, Feb 4, 2016 at 2:09 PM, Linus Torvalds
wrote:
>
> The "+" should be "-", of course - the point is to shift up the value
> by 8 bits for odd cases, and we need to load starting one byte early
> for that. The idea is that we use the byte shifter in the load unit to
> do some work for us.
Ok
On Thu, Feb 4, 2016 at 2:43 PM, Tom Herbert wrote:
>
> The reason I did this in assembly is precisely about the your point of
> having to close the carry chains with adcq $0. I do have a first
> implementation in C which using switch() to handle alignment, excess
> length less than 8 bytes, and th
On Thu, Feb 4, 2016 at 1:46 PM, Linus Torvalds
wrote:
> I missed the original email (I don't have net-devel in my mailbox),
> but based on Ingo's quoting have a more fundamental question:
>
> Why wasn't that done with C code instead of asm with odd numerical targets?
>
The reason I did this in ass
On Thu, Feb 4, 2016 at 1:46 PM, Linus Torvalds
wrote:
>
> static const unsigned long mask[9] = {
> 0x,
> 0xff00,
> 0x,
> 0xff00,
> 0x,
>
I missed the original email (I don't have net-devel in my mailbox),
but based on Ingo's quoting have a more fundamental question:
Why wasn't that done with C code instead of asm with odd numerical targets?
It seems likely that the real issue is avoiding the short loops (that
will cause branch pre
On Thu, Feb 4, 2016 at 12:59 PM, Tom Herbert wrote:
> On Thu, Feb 4, 2016 at 9:09 AM, David Laight wrote:
>> From: Tom Herbert
>> ...
>>> > If nothing else reducing the size of this main loop may be desirable.
>>> > I know the newer x86 is supposed to have a loop buffer so that it can
>>> > basic
On Thu, Feb 4, 2016 at 9:09 AM, David Laight wrote:
> From: Tom Herbert
> ...
>> > If nothing else reducing the size of this main loop may be desirable.
>> > I know the newer x86 is supposed to have a loop buffer so that it can
>> > basically loop on already decoded instructions. Normally it is o
On Thu, Feb 4, 2016 at 11:44 AM, Tom Herbert wrote:
> On Thu, Feb 4, 2016 at 11:22 AM, Alexander Duyck
> wrote:
>> On Wed, Feb 3, 2016 at 11:18 AM, Tom Herbert wrote:
>>> Implement assembly routine for csum_partial for 64 bit x86. This
>>> primarily speeds up checksum calculation for smaller len
On Thu, Feb 4, 2016 at 11:22 AM, Alexander Duyck
wrote:
> On Wed, Feb 3, 2016 at 11:18 AM, Tom Herbert wrote:
>> Implement assembly routine for csum_partial for 64 bit x86. This
>> primarily speeds up checksum calculation for smaller lengths such as
>> those that are present when doing skb_postpu
On Thu, Feb 4, 2016 at 11:22 AM, Alexander Duyck
wrote:
> On Wed, Feb 3, 2016 at 11:18 AM, Tom Herbert wrote:
>> Implement assembly routine for csum_partial for 64 bit x86. This
>> primarily speeds up checksum calculation for smaller lengths such as
>> those that are present when doing skb_postpu
On Thu, Feb 4, 2016 at 2:56 AM, Ingo Molnar wrote:
>
> * Ingo Molnar wrote:
>
>> s/!CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>>
>> > +
>> > + /* Check length */
>> > +10:cmpl$8, %esi
>> > + jg 30f
>> > + jl 20f
>> > +
>> > + /* Exactly 8 bytes length */
>> > + addl
On Wed, Feb 3, 2016 at 11:18 AM, Tom Herbert wrote:
> Implement assembly routine for csum_partial for 64 bit x86. This
> primarily speeds up checksum calculation for smaller lengths such as
> those that are present when doing skb_postpull_rcsum when getting
> CHECKSUM_COMPLETE from device or after
From: Tom Herbert
...
> > If nothing else reducing the size of this main loop may be desirable.
> > I know the newer x86 is supposed to have a loop buffer so that it can
> > basically loop on already decoded instructions. Normally it is only
> > something like 64 or 128 bytes in size though. You
On Thu, Feb 4, 2016 at 8:51 AM, Alexander Duyck
wrote:
> On Thu, Feb 4, 2016 at 3:08 AM, David Laight wrote:
>> From: Tom Herbert
>>> Sent: 03 February 2016 19:19
>> ...
>>> + /* Main loop */
>>> +50: adcq0*8(%rdi),%rax
>>> + adcq1*8(%rdi),%rax
>>> + adcq2*8(%rdi),%rax
>>
On Thu, Feb 4, 2016 at 3:08 AM, David Laight wrote:
> From: Tom Herbert
>> Sent: 03 February 2016 19:19
> ...
>> + /* Main loop */
>> +50: adcq0*8(%rdi),%rax
>> + adcq1*8(%rdi),%rax
>> + adcq2*8(%rdi),%rax
>> + adcq3*8(%rdi),%rax
>> + adcq4*8(%rdi),%rax
>>
From: Tom Herbert
> Sent: 03 February 2016 19:19
...
> + /* Main loop */
> +50: adcq0*8(%rdi),%rax
> + adcq1*8(%rdi),%rax
> + adcq2*8(%rdi),%rax
> + adcq3*8(%rdi),%rax
> + adcq4*8(%rdi),%rax
> + adcq5*8(%rdi),%rax
> + adcq6*8(%rdi),%rax
> +
* Ingo Molnar wrote:
> s/!CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>
> > +
> > + /* Check length */
> > +10:cmpl$8, %esi
> > + jg 30f
> > + jl 20f
> > +
> > + /* Exactly 8 bytes length */
> > + addl(%rdi), %eax
> > + adcl4(%rdi), %eax
> > + RETURN
> > +
* Tom Herbert wrote:
> Implement assembly routine for csum_partial for 64 bit x86. This
> primarily speeds up checksum calculation for smaller lengths such as
> those that are present when doing skb_postpull_rcsum when getting
> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY
> conve
Implement assembly routine for csum_partial for 64 bit x86. This
primarily speeds up checksum calculation for smaller lengths such as
those that are present when doing skb_postpull_rcsum when getting
CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY
conversion.
CONFIG_HAVE_EFFICIENT_UNAL
29 matches
Mail list logo