Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

Alexander Duyck Thu, 04 Feb 2016 13:10:42 -0800

On Thu, Feb 4, 2016 at 12:59 PM, Tom Herbert <t...@herbertland.com> wrote:
> On Thu, Feb 4, 2016 at 9:09 AM, David Laight <david.lai...@aculab.com> wrote:
>> From: Tom Herbert
>> ...
>>> > If nothing else reducing the size of this main loop may be desirable.
>>> > I know the newer x86 is supposed to have a loop buffer so that it can
>>> > basically loop on already decoded instructions.  Normally it is only
>>> > something like 64 or 128 bytes in size though.  You might find that
>>> > reducing this loop to that smaller size may improve the performance
>>> > for larger payloads.
>>>
>>> I saw 128 to be better in my testing. For large packets this loop does
>>> all the work. I see performance dependent on the amount of loop
>>> overhead, i.e. we got it down to two non-adcq instructions but it is
>>> still noticeable. Also, this helps a lot on sizes up to 128 bytes
>>> since we only need to do single call in the jump table and no trip
>>> through the loop.
>>
>> But one of your 'loop overhead' instructions is 'loop'.
>> Look at http://www.agner.org/optimize/instruction_tables.pdf
>> you don't want to be using 'loop' on intel cpus.
>>
> I'm not following. We can replace loop with decl %ecx and jg, but why
> is that better?


Because loop takes something like 7 cycles whereas the decl/jg
approach takes 2 or 3.  It is probably one of the reasons things look
so much better with the loop unrolled.

- Alex

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

Reply via email to