On Thu, 12 Apr 2018 07:38:25 +0000 "Karlsson, Magnus" <magnus.karls...@intel.com> wrote:
> > -----Original Message----- > > From: Michael S. Tsirkin [mailto:m...@redhat.com] > > Sent: Thursday, April 12, 2018 4:16 AM > > To: Björn Töpel <bjorn.to...@gmail.com> > > Cc: Karlsson, Magnus <magnus.karls...@intel.com>; Duyck, Alexander H > > <alexander.h.du...@intel.com>; alexander.du...@gmail.com; > > john.fastab...@gmail.com; a...@fb.com; bro...@redhat.com; > > willemdebruijn.ker...@gmail.com; dan...@iogearbox.net; > > netdev@vger.kernel.org; michael.lundkv...@ericsson.com; Brandeburg, > > Jesse <jesse.brandeb...@intel.com>; Singhai, Anjali > > <anjali.sing...@intel.com>; Zhang, Qi Z <qi.z.zh...@intel.com>; > > ravineet.si...@ericsson.com > > Subject: Re: [RFC PATCH v2 03/14] xsk: add umem fill queue support and > > mmap > > > > On Tue, Mar 27, 2018 at 06:59:08PM +0200, Björn Töpel wrote: > > > @@ -30,4 +31,18 @@ struct xdp_umem_reg { > > > __u32 frame_headroom; /* Frame head room */ }; > > > > > > +/* Pgoff for mmaping the rings */ > > > +#define XDP_UMEM_PGOFF_FILL_QUEUE 0x100000000 > > > + > > > +struct xdp_queue { > > > + __u32 head_idx __attribute__((aligned(64))); > > > + __u32 tail_idx __attribute__((aligned(64))); }; > > > + > > > +/* Used for the fill and completion queues for buffers */ struct > > > +xdp_umem_queue { > > > + struct xdp_queue ptrs; > > > + __u32 desc[0] __attribute__((aligned(64))); }; > > > + > > > #endif /* _LINUX_IF_XDP_H */ > > > > So IIUC it's a head/tail ring of 32 bit descriptors. > > > > In my experience (from implementing ptr_ring) this implies that head/tail > > cache lines bounce a lot between CPUs. Caching will help some. You are also > > forced to use barriers to check validity which is slow on some > > architectures. > > > > If instead you can use a special descriptor value (e.g. 0) as a valid > > signal, > > things work much better: > > > > - you read descriptor atomically, if it's not 0 it's fine > > - same with write - write 0 to pass it to the other side > > - there is a data dependency so no need for barriers (except on dec alpha) > > - no need for power of 2 limitations, you can make it any size you like > > - easy to resize too > > > > architecture (if not implementation) would be shared with ptr_ring so some > > of the optimization ideas like batched updates could be lifted from there. > > > > When I was building ptr_ring, any head/tail design underperformed storing > > valid flag with data itself. YMMV. I fully agree with MST here. This is also my experience. I even dropped my own Array-based Lock-Free (ALF) queue implementation[1] in favor of ptr_ring. (Where I try to amortize this cost by bulking, but this cause the queue to become non-wait-free) [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/include/linux/alf_queue.h > I think you are definitely right in that there are ways in which > we can improve performance here. That said, the current queue > performs slightly better than the previous one we had that was > more or less a copy of one of your first virtio 1.1 proposals > from little over a year ago. It had bidirectional queues and a > valid flag in the descriptor itself. The reason we abandoned this > was not poor performance (it was good), but a need to go to > unidirectional queues. Maybe I should have only changed that > aspect and kept the valid flag. > > Anyway, I will take a look at ptr_ring and run some experiments > along the lines of what you propose to get some > numbers. Considering your experience with these kind of > structures, you are likely right. I just need to convince > myself :-). When benchmarking, be careful that you don't measure the "wrong" queue situation. When doing this kind of "overload" benchmarking, you will likely create a situation where the queue is always full (which hopefully isn't a production use-case). In the almost/always full queue situation, using the element values to sync-on (like MST propose) will still cause the cache-line bouncing (that we want to avoid). MST explain and have addressed this situation for ptr_ring in: commit fb9de9704775 ("ptr_ring: batch ring zeroing") https://git.kernel.org/torvalds/c/fb9de9704775 -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer