Quad loops (4+4) are sometimes effective. See ip4_lookup_inline(...) and 
dpdk_device_input(...) for examples. The limiting factor: gcc runs out of 
registers.

I've yet to discover a case where a sextuple loop (6+6) would seem likely to be 
effective. Please let folks know if you discover one.

Sometimes changing the prefetch stride to prefetch two loop iterations ahead is 
helpful. Instead of prefetching buffers[2] and buffers[3], prefetch buffers[4] 
and buffers[5]. The n_left_from constraint in the dual loop must be adjusted to 
avoid indexing off the end of the array.

HTH... Dave

From: vpp-dev-boun...@lists.fd.io [mailto:vpp-dev-boun...@lists.fd.io] On 
Behalf Of Mli
Sent: Friday, November 18, 2016 2:55 PM
To: vpp-dev@lists.fd.io
Subject: [vpp-dev] vpp node implementation

We plan to implement a new node in VPP. I have the following 2 small questions


(1)   2 + 2 principle



For each vpp node, first it prefetch the next 2 packets (see the following 
codes) and then process the 2 current packets. Do we must  to follow this 
programming style?  Can we do 4+4 or 6+6 ?

{

    vlib_prefetch_buffer_header(...);

    vlib_prefetch_buffer_header(...);



    CLIB_PREFETCH(...);

    CLIB_PREFETCH(...);

}



(2)   Code size for one node

I know that one node's instructions should fit into ICache for the performance 
issue. But the current X86 has 32K ICache, do we need to consider the code size 
of one node ?



Thx



Ming

_______________________________________________
vpp-dev mailing list
vpp-dev@lists.fd.io
https://lists.fd.io/mailman/listinfo/vpp-dev

Reply via email to