> From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Ananyev, > Konstantin > Sent: Thursday, May 14, 2020 6:47 PM > > > > > > -static inline unsigned > > > > +static inline unsigned int > > > > rte_ring_count(const struct rte_ring *r) > > > > { > > > > uint32_t prod_tail = r->prod.tail; > > > > uint32_t cons_tail = r->cons.tail; > > > > uint32_t count = (prod_tail - cons_tail) & r->mask; > > > > - return (count > r->capacity) ? r->capacity : count; > > > > + return likely(count <= r->capacity) ? count : r->capacity; > > > > > > Honestly, I don't see there is any point of that change: > > > I think it wouldn't change anything in terms of functionality > > > or performance. > > > > Chapter 3.4.1 "Branch Prediction Optimization" in the Intel 64 and > IA-32 Architectures Optimization Reference Manual recommends this > > kind of optimization as Assembly/Compiler Coding Rule 3, which is why > I rearranged the trigraph. Essentially, there is a limit to the number > > of BTB (Branch Target Buffer) entries, so they should be conserved if > possible. > > > > In addition to that, I have added the likely() because I consider it > nearly impossible that the count will exceed the capacity. > > > > However, it's not the first time I see this kind of response to a > suggested branch optimization on the DPDK mailing list. Everyone seem > to > > think that branch prediction is infinite and always works. It may > seem as if infinite on trivial applications, but BTB entries may be a > scarce > > resource on complex applications. I assume Intel's recommendations > are not just for the fun of it. > > I think it is better to leave such level of micro-optimizations to the > compiler. > BTW, in that particular case, compiler most likely will generate a code > without any branches at all (at least for IA). > Let say on my box with gcc 7.3: > > $ cat trc1.c > #include <stdint.h> > #include <rte_config.h> > #include <rte_ring.h> > > uint32_t > fffx1(const struct rte_ring *r) > { > uint32_t prod_tail = r->prod.tail; > uint32_t cons_tail = r->cons.tail; > uint32_t count = (prod_tail - cons_tail) & r->mask; > return (count > r->capacity) ? r->capacity : count; > } > > uint32_t > fffx2(const struct rte_ring *r) > { > uint32_t prod_tail = r->prod.tail; > uint32_t cons_tail = r->cons.tail; > uint32_t count = (prod_tail - cons_tail) & r->mask; > return likely(count <= r->capacity) ? count : r->capacity; > } > > $ gcc -m64 -O3 -march=native -I${RTE_SDK}/x86_64-native-linuxapp- > gcc/include -c trc1.c > > $ objdump -d trc1.o > > 0000000000000000 <fffx1>: > 0: 8b 87 84 00 00 00 mov 0x84(%rdi),%eax > 6: 8b 97 04 01 00 00 mov 0x104(%rdi),%edx > c: 29 d0 sub %edx,%eax > e: 8b 57 38 mov 0x38(%rdi),%edx > 11: 23 47 34 and 0x34(%rdi),%eax > 14: 39 d0 cmp %edx,%eax > 16: 0f 47 c2 cmova %edx,%eax > 19: c3 retq > 1a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) > > 0000000000000020 <fffx2>: > 20: 8b 87 84 00 00 00 mov 0x84(%rdi),%eax > 26: 8b 97 04 01 00 00 mov 0x104(%rdi),%edx > 2c: 29 d0 sub %edx,%eax > 2e: 8b 57 38 mov 0x38(%rdi),%edx > 31: 23 47 34 and 0x34(%rdi),%eax > 34: 39 d0 cmp %edx,%eax > 36: 0f 47 c2 cmova %edx,%eax > 39: c3 retq > > As you can see, there is no difference. >
Thank you for the detailed feedback. Reality trumps theory, so I will leave the count function as is. :-) > > > > Konstantin, please note that I'm letting out my frustration about the > general misconception about branch prediction here. You are doing a > > great job, so I feel bad about responding like this to you. > > No worries, in fact I am glad to know that DPDK contributors > read IA optimization manual that thoughtfully 😊 > > Konstantin