On Tue, Mar 27, 2018 at 06:59:08PM +0200, Björn Töpel wrote: > @@ -30,4 +31,18 @@ struct xdp_umem_reg { > __u32 frame_headroom; /* Frame head room */ > }; > > +/* Pgoff for mmaping the rings */ > +#define XDP_UMEM_PGOFF_FILL_QUEUE 0x100000000 > + > +struct xdp_queue { > + __u32 head_idx __attribute__((aligned(64))); > + __u32 tail_idx __attribute__((aligned(64))); > +}; > + > +/* Used for the fill and completion queues for buffers */ > +struct xdp_umem_queue { > + struct xdp_queue ptrs; > + __u32 desc[0] __attribute__((aligned(64))); > +}; > + > #endif /* _LINUX_IF_XDP_H */
So IIUC it's a head/tail ring of 32 bit descriptors. In my experience (from implementing ptr_ring) this implies that head/tail cache lines bounce a lot between CPUs. Caching will help some. You are also forced to use barriers to check validity which is slow on some architectures. If instead you can use a special descriptor value (e.g. 0) as a valid signal, things work much better: - you read descriptor atomically, if it's not 0 it's fine - same with write - write 0 to pass it to the other side - there is a data dependency so no need for barriers (except on dec alpha) - no need for power of 2 limitations, you can make it any size you like - easy to resize too architecture (if not implementation) would be shared with ptr_ring so some of the optimization ideas like batched updates could be lifted from there. When I was building ptr_ring, any head/tail design underperformed storing valid flag with data itself. YMMV. -- MST