>
> > > -static inline unsigned
> > > +static inline unsigned int
> > >  rte_ring_count(const struct rte_ring *r)
> > >  {
> > >   uint32_t prod_tail = r->prod.tail;
> > >   uint32_t cons_tail = r->cons.tail;
> > >   uint32_t count = (prod_tail - cons_tail) & r->mask;
> > > - return (count > r->capacity) ? r->capacity : count;
> > > + return likely(count <= r->capacity) ? count : r->capacity;
> >
> > Honestly, I don't see there is any point of that change:
> > I think it wouldn't change anything in terms of functionality
> > or performance.
> 
> Chapter 3.4.1 "Branch Prediction Optimization" in the Intel 64 and IA-32 
> Architectures Optimization Reference Manual recommends this
> kind of optimization as Assembly/Compiler Coding Rule 3, which is why I 
> rearranged the trigraph. Essentially, there is a limit to the number
> of BTB (Branch Target Buffer) entries, so they should be conserved if 
> possible.
> 
> In addition to that, I have added the likely() because I consider it nearly 
> impossible that the count will exceed the capacity.
> 
> However, it's not the first time I see this kind of response to a suggested 
> branch optimization on the DPDK mailing list. Everyone seem to
> think that branch prediction is infinite and always works. It may seem as if 
> infinite on trivial applications, but BTB entries may be a scarce
> resource on complex applications. I assume Intel's recommendations are not 
> just for the fun of it.

I think it is better to leave such level of micro-optimizations to the compiler.
BTW, in that particular case, compiler most likely will generate a code
without any branches at all (at least for IA).
Let say on my box with gcc 7.3:

$ cat trc1.c
#include <stdint.h>
#include <rte_config.h>
#include <rte_ring.h>

uint32_t
fffx1(const struct rte_ring *r)
{
        uint32_t prod_tail = r->prod.tail;
        uint32_t cons_tail = r->cons.tail;
        uint32_t count = (prod_tail - cons_tail) & r->mask;
        return (count > r->capacity) ? r->capacity : count;
}

uint32_t
fffx2(const struct rte_ring *r)
{
        uint32_t prod_tail = r->prod.tail;
        uint32_t cons_tail = r->cons.tail;
        uint32_t count = (prod_tail - cons_tail) & r->mask;
        return likely(count <= r->capacity) ? count : r->capacity;
}
   
$ gcc -m64 -O3 -march=native -I${RTE_SDK}/x86_64-native-linuxapp-gcc/include -c 
trc1.c

$ objdump -d trc1.o

0000000000000000 <fffx1>:
   0:   8b 87 84 00 00 00       mov    0x84(%rdi),%eax
   6:   8b 97 04 01 00 00       mov    0x104(%rdi),%edx
   c:   29 d0                   sub    %edx,%eax
   e:   8b 57 38                mov    0x38(%rdi),%edx
  11:   23 47 34                and    0x34(%rdi),%eax
  14:   39 d0                   cmp    %edx,%eax
  16:   0f 47 c2                cmova  %edx,%eax
  19:   c3                      retq
  1a:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)

0000000000000020 <fffx2>:
  20:   8b 87 84 00 00 00       mov    0x84(%rdi),%eax
  26:   8b 97 04 01 00 00       mov    0x104(%rdi),%edx
  2c:   29 d0                   sub    %edx,%eax
  2e:   8b 57 38                mov    0x38(%rdi),%edx
  31:   23 47 34                and    0x34(%rdi),%eax
  34:   39 d0                   cmp    %edx,%eax
  36:   0f 47 c2                cmova  %edx,%eax
  39:   c3                      retq

As you can see, there is no difference.

> 
> Konstantin, please note that I'm letting out my frustration about the general 
> misconception about branch prediction here. You are doing a
> great job, so I feel bad about responding like this to you.

No worries, in fact I am glad to know that DPDK contributors
read IA optimization manual that thoughtfully 😊

Konstantin

Reply via email to