Hi Jianbo, I tested your neon changes on thunderx. I am seeing a performance regression of ~10% for LPM case and ~20% for EM case with your changes. Did you see improvement on any arm64 platform with these changes. If yes, how much was the improvement?
FYI, I had also tried vectorizing the l3fwd app with neon. Few of the optimizations that I can suggest that helped in my case. * Packet data prefetch is missing in the x86 sse version compared to the scalar version (l3fwd_lpm_send_packets vs l3fwd_lpm_no_opt_send_packets) . I couldn't understand why this was not done in x86. But adding the prefetch was improving performance for thunderx. * Offsets to some packet elements like eth_hdr, ip header, packet type etc. are recalculated in different functions. Calculating them once, caching them and passing them directly to different functions was improving performance. * There are 3 different loops in l3fwd_lpm_send_packets where we iterate over the packets. One each for processx4_step1 and processx4_step2 and one in send_packets_multi. Unifying these loops were also helping. Thanks and Regards Ashwin