RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

David Laight Wed, 10 Feb 2016 03:43:37 -0800

From: George Spelvin
> Sent: 10 February 2016 00:54
> To: David Laight; linux-ker...@vger.kernel.org; li...@horizon.com; 
> netdev@vger.kernel.org;
> David Laight wrote:
> > Since adcx and adox must execute in parallel I clearly need to re-remember
> > how dependencies against the flags register work. I'm sure I remember
> > issues with 'false dependencies' against the flags.
> 
> The issue is with flags register bits that are *not* modified by
> an instruction.  If the register is treated as a monolithic entity,
> then the previous values of those bits must be considered an *input*
> to the instruction, forcing serialization.
> 
> The first step in avoiding this problem is to consider the rarely-modified
> bits (interrupt, direction, trap, etc.) to be a separate logical register
> from the arithmetic flags (carry, overflow, zero, sign, aux carry and parity)
> which are updated by almost every instruction.
> 
> An arithmetic instruction overwrites the arithmetic flags (so it's only
> a WAW dependency which can be broken by renaming) and doesn't touch the
> status flags (so no dependency).
> 
> However, on x86 even the arithmetic flags aren't updated consistently.
> The biggest offender are the (very common!) INC/DEC instructions,
> which update all of the arithmetic flags *except* the carry flag.
> 
> Thus, the carry flag is also renamed separately on every superscalar
> x86 implementation I've ever heard of.


Ah, that is the little fact I'd forgotten.
...
> Anyway, I'm sure that when Intel defined ADCX and ADOX they felt that
> it was reasonable to commit to always renaming CF and OF separately.

Separate renaming allows:
1) The value to tested without waiting for pending updates to complete.
   Useful for IE and DIR.
2) Instructions that modify almost all the flags to execute without
   waiting for a previous instruction to complete.
   So separating 'carry' allows inc/dec to execute without waiting
   for previous arithmetic to complete.

The latter should remove the dependency (both ways) between 'adc' and
'dec, jnz' in a checksum loop.

I can't see any obvious gain from separating out O or Z (even with
adcx and adox). You'd need some other instructions that don't set O (or Z)
but set some other useful flags.
(A decrement that only set Z for instance.)

> > However you still need a loop construct that doesn't modify 'o' or 'c'.
> > Using leal, jcxz, jmp might work.
> > (Unless broadwell actually has a fast 'loop' instruction.)
> 
> According to Agner Fog (http://agner.org/optimize/instruction_tables.pdf),
> JCXZ is reasonably fast (2 uops) on almost all 64-bit CPUs, right back
> to K8 and Merom.  The one exception is Precott.  JCXZ and LOOP are 4
> uops on those processors.  But 64 bit in general sucked on Precott,
> so how much do we care?
> 
> AMD:  LOOP is slow (7 uops) on K8, K10, Bobcat and Jaguar.
>       JCXZ is acceptable on all of them.
>       LOOP and JCXZ are 1 uop on Bulldozer, Piledriver and Steamroller.
> Intel:        LOOP is slow (7+ uops) on all processors up to and including 
> Skylake.
>       JCXZ is 2 upos on everything from P6 to Skylake exacpt for:
>       - Prescott (JCXZ & loop both 4 uops)
>       - 1st gen Atom (JCXZ 3 uops, LOOP 8 uops)
>       I can't find any that it's fast on.

While LOOP could be used on Bulldozer+ an equivalently fast loop
can be done with inc/dec and jnz.
So you only care about LOOP/JCXZ when ADOX is supported.

I think the fastest loop is:
10:     adc     %rax,0(%rdi,%rcx,8)
        inc     %rcx
        jnz     10b
but check if any cpu add an extra clock for the 'scaled' offset
(they might be faster if %rdi is incremented).
That loop looks like it will have no overhead on recent cpu.

        David

RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

Reply via email to