Quad loops (4+4) are sometimes effective. See ip4_lookup_inline(...) and dpdk_device_input(...) for examples. The limiting factor: gcc runs out of registers.
I've yet to discover a case where a sextuple loop (6+6) would seem likely to be effective. Please let folks know if you discover one. Sometimes changing the prefetch stride to prefetch two loop iterations ahead is helpful. Instead of prefetching buffers[2] and buffers[3], prefetch buffers[4] and buffers[5]. The n_left_from constraint in the dual loop must be adjusted to avoid indexing off the end of the array. HTH... Dave From: vpp-dev-boun...@lists.fd.io [mailto:vpp-dev-boun...@lists.fd.io] On Behalf Of Mli Sent: Friday, November 18, 2016 2:55 PM To: vpp-dev@lists.fd.io Subject: [vpp-dev] vpp node implementation We plan to implement a new node in VPP. I have the following 2 small questions (1) 2 + 2 principle For each vpp node, first it prefetch the next 2 packets (see the following codes) and then process the 2 current packets. Do we must to follow this programming style? Can we do 4+4 or 6+6 ? { vlib_prefetch_buffer_header(...); vlib_prefetch_buffer_header(...); CLIB_PREFETCH(...); CLIB_PREFETCH(...); } (2) Code size for one node I know that one node's instructions should fit into ICache for the performance issue. But the current X86 has 32K ICache, do we need to consider the code size of one node ? Thx Ming
_______________________________________________ vpp-dev mailing list vpp-dev@lists.fd.io https://lists.fd.io/mailman/listinfo/vpp-dev