Hi Ashwin, On 2 May 2017 at 19:47, Sekhar, Ashwin <ashwin.sek...@cavium.com> wrote: > Hi Jianbo, > > I tested your neon changes on thunderx. I am seeing a performance > regression of ~10% for LPM case and ~20% for EM case with your changes. > Did you see improvement on any arm64 platform with these changes. If > yes, how much was the improvement?
Thanks for your reviewing and testing. For some reason, I have not done much with the performance testing. I'll send a new version later after tuning the performance. Thanks! Jianbo > > FYI, I had also tried vectorizing the l3fwd app with neon. Few of the > optimizations that I can suggest that helped in my case. > > * Packet data prefetch is missing in the x86 sse version compared to > the scalar version (l3fwd_lpm_send_packets vs > l3fwd_lpm_no_opt_send_packets) . I couldn't understand why this was not > done in x86. But adding the prefetch was improving performance for > thunderx. > > * Offsets to some packet elements like eth_hdr, ip header, packet type > etc. are recalculated in different functions. Calculating them once, > caching them and passing them directly to different functions was > improving performance. > > * There are 3 different loops in l3fwd_lpm_send_packets where we > iterate over the packets. One each for processx4_step1 and > processx4_step2 and one in send_packets_multi. Unifying these loops > were also helping. > > Thanks and Regards > Ashwin >