Re: [vpp-dev] Macros to support prefetch w/o code copy.

Damjan Marion via Lists.Fd.Io Fri, 12 Jul 2019 06:11:31 -0700

> 
> On 12 Jul 2019, at 06:14, Christian Hopps <cho...@chopps.org> wrote:
> 
> [re-sending w/o extra email addresses]
> 
> Hi vpp-dev,
> 
> So I'm writing a plugin to support IPTFS 
> (https://tools.ietf.org/html/draft-hopps-ipsecme-iptfs-01), and during this 
> process I've written a macro to support prefetching the second half of N 
> (where N is power of 2) buffers and acting on the first half (i.e., the 
> standard pattern of prefetch code in vpp).
> 
> The macro allows one to simply write the "single buffer" case code, and not 
> have to have copies of the code for each power of 2 to be supported.
> 
> One slight variation on standard VPP style is that the macro and code use 
> beginning and end pointers for buffer index arrays instead of an array index 
> and array length.
> 
> I'm sending this mail b/c it may be a little while before I can submit my 
> code, and the arm guys are currently adding prefetch code so I thought this 
> could be useful for that. Also I might get some useful feedback on the code 
> as well. :)
> 
> Here are the prefetch macros:
> 
> static inline int
> half_range (vlib_buffer_t **start, vlib_buffer_t **end, int limit)
> {
>   u32 range = (end - start) / 2;
>   return range < limit ? range : limit;
> }
> 
> #define FOREACH_PREFETCH(b, eb)                                              \
>   do                                                                         \
>     {                                                                        \
>       vlib_buffer_t **bp, **ebp;                                             \
>       for (u32 half = half_range ((b), (eb), CLIB_N_PREFETCHES); (b) < (eb); \
>            half = half_range ((b), (eb), CLIB_N_PREFETCHES))                 \
>         {                                                                    \
>                                                                              \
>           /* prefetch second half */                                         \
>           if (half)                                                          \
>             for (bp = (b) + half, ebp = bp + half; bp < ebp; bp++)           \
>               vlib_prefetch_buffer_header (*bp, LOAD);                       \
>           else                                                               \
>             half = 1; /* Process at least one next */                        \
>                                                                              \
>           /* process first half or last one */                               \
>           for (ebp = (b) + half; (b) < ebp && to < eto; (b)++)
> 
> #define END_FOREACH_PREFETCH \
>   }                          \
>   }                          \
>   while (0)
> 
> Here are the macros I use to handle filling up next frames in parallel
> and working with arrays using end pointers vs. indices and lengths.
> 
> /* in node_funcs.h */
> 
> #define 
> vlib_get_next_frame_macro_p(vm,node,next_index,vectors,evectors,alloc_new_frame)
>  \
>   do {                                                              \
>     vlib_frame_t * _f                                               \
>       = vlib_get_next_frame_internal ((vm), (node), (next_index),   \
>                                       (alloc_new_frame));           \
>     u32 _n = _f->n_vectors;                                         \
>     (vectors) = vlib_frame_vector_args (_f) + _n * sizeof ((vectors)[0]);\
>     (evectors) = (vectors) + (VLIB_FRAME_SIZE - _n);                \
>   } while (0)
> 
> #define vlib_get_next_frame_p(vm,node,next_index,vectors,evectors)  \
>   vlib_get_next_frame_macro_p (vm, node, next_index,                \
>                                vectors, evectors,                   \
>                                /* alloc new frame */ 0)
> 
> /* in iptfs code.. */
> 
> #define vlib_put_get_next_frame(vm, node, ni, to, eto)              \
>   do                                                                \
>     {                                                               \
>       /* Put the frame if it is full */                             \
>       if ((to) && (eto) != (to))                                    \
>         ;                                                           \
>       else                                                          \
>         {                                                           \
>           if ((to))                                                 \
>             vlib_put_next_frame ((vm), (node), (ni), (eto) - (to)); \
>           vlib_get_next_frame_p ((vm), (node), (ni), (to), (eto));  \
>         }                                                           \
>     }                                                               \
>   while (0)
> 
> #define vlib_put_get_next_frame_a(vm, node, ni, toa, etoa) \
>   vlib_put_get_next_frame (vm, node, ni, (toa)[(ni)], (etoa)[(ni)])
> 
> Finally here's the payoff, a slightly trimmed down use of the above using 
> only a single copy of the functional code. In this case it's going to start 
> with prefetching 8 buffers and handling 8, then 4, then 2, then 1, then none 
> (it starts with 8 b/c CLIB_N_PREFETCHES is 16, but any power of 2 would 
> work). Anyway the normal pattern in VPP would require 5 copies of the code in 
> this case instead of the single copy used here.
> 
> iptfs_encap_node_inline (vlib_main_t *vm, vlib_node_runtime_t *node,
>                          vlib_frame_t *frame, int is_tun)
> {
>   vlib_buffer_t *bufs[VLIB_FRAME_SIZE];
>   vlib_buffer_t **b = bufs, **eb = bufs + frame->n_vectors;
>   vlib_get_buffers (vm, vlib_frame_vector_args (frame), bufs,
>                     frame->n_vectors);
>   u32 *to[IPTFS_ENCAP_N_NEXT] = {};
>   u32 *eto[IPTFS_ENCAP_N_NEXT] = {};
> 
>   while (b < eb)
>     {
>       FOREACH_PREFETCH (b, eb)
>       {
>         /*
>          * Handle single buffer found in (*b)
>          */
> 
>         u8 *data = vlib_buffer_get_current (*b);
> 
>         /* ... do stuff with packet in buffer (*b) ... */
> 
>         if (send to SOME_NEXT_NODE condition)
>           {
>             ni = SOME_NEXT_NODE_INDEX;
>             vlib_put_get_next_frame_a (vm, node, ni, to, eto);
>             *to[ni]++ = vlib_get_buffer_index (vm, *b);
>             continue;
>           }
>         if (drop condition)
>           {
>           dropit:
>             vlib_put_get_next_frame_a (vm, node, IPTFS_ENCAP_NEXT_DROP, to,
>                                        eto);
>             *to[IPTFS_ENCAP_NEXT_DROP]++ = vlib_get_buffer_index (vm, *b);
>             (*b)->error = node->errors[IPTFS_ENCAP_ERROR_Q_FULL];
>             continue;
>           }
>       }
>       END_FOREACH_PREFETCH;
>     }
> 
>   /* Put any next frames we've used */
>   for (uint i = 0; i < IPTFS_ENCAP_N_NEXT; i++)
>     if (to[i])
>       vlib_put_next_frame (vm, node, i, eto[i] - to[i]);
> 
>   return frame->n_vectors;
> }
>


Hi Cristian,

You are not fist one trying to do that, few of us did and we all failed :).

Few comments: 

1. It is going to be painful to use "perf top" on this code, as annotation will 
be useless when those macros are used
2. while most of our nodes are just mechanically duplicated, to get best 
performace you need to manually interleave, prefetches with rest of the code 
(good example is dpdk-input node)
3. I don’t see how your macro will work with double prefetch strides, i.e. 
prefetch data which depends on b->current_data.
4. CLIB_N_PREFETCHES can hardly be 16 as number of outstanding prefetches on 
x86 is 12 and that number did not change for many years. Beside that those 12 
slots are shared with 2 hardware prefetches so it is hard to expect that you 
will have more than 8 available. On arm I think it is even less specially on 
low end CPUs.
5. you should consider using vlib_buffer_enqueue_to_next() instead of 
vlib_get_next_frame/vlib_put_next_frame scheme
6. use of vlib_get_buffer_index() in your code is sub-optimal, there is no 
sense in converting buffer pointer back to index if you already have buffer 
index, vlib_get_buffer_index() will probably eat 1-2 clocks for no reason (it 
does shift + sub)

My personal opinion is that manual duplication for quad, dual and single loop 
in worst case is less evil than introducing that kind of macros.
you. will save few lines of code but you will also loose freedom of fine tuning 
code.

Hope this helps :)

— 
Damjan

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#13499): https://lists.fd.io/g/vpp-dev/message/13499
Mute This Topic: https://lists.fd.io/mt/32438895/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [vpp-dev] Macros to support prefetch w/o code copy.

Reply via email to