> From: Stephen Hemminger [mailto:step...@networkplumber.org]
> Sent: Tuesday, 5 November 2024 19.59
> 
> On Sat, 2 Nov 2024 23:28:49 +0100
> Morten Brørup <m...@smartsharesystems.com> wrote:
> 
> > > > >
> > > > > Probably the hardest part of using io_uring is figuring out how
> to
> > > > > collect
> > > > > completions. The simplest way would be to handle all
> completions rx
> > > and
> > > > > tx
> > > > > in the rx_burst function.
> > > >
> > > > Please don't mix RX and TX, unless explicitly requested by the
> > > application through the recently introduced "mbuf recycle" feature.
> > >
> > > The issue is Rx and Tx share a single fd and ioring for completion
> is
> > > per fd.
> > > The implementation for ioring came from the storage side so
> initially
> > > it was for fixing
> > > the broken Linux AIO support.
> > >
> > > Some other devices only have single interrupt or ring shared with
> rx/tx
> > > so not unique.
> > > Virtio, netvsc, and some NIC's.
> > >
> > > The problem is that if Tx completes descriptors then there needs to
> be
> > > locking
> > > to prevent Rx thread and Tx thread overlapping. And a spin lock is
> a
> > > performance buzz kill.
> >
> > Brainstorming a bit here...
> > What if the new TAP io_uring PMD is designed to use two io_urings per
> port, one for RX and another one for TX on the same TAP interface?
> > This requires that a TAP interface can be referenced via two file
> descriptors (one fd for the RX io_uring and another fd for the TX
> io_uring), e.g. by using dup() to create the additional file
> descriptor. I don't know if this is possible, and if it works with
> io_uring.
> 
> There a couple of problems with multiple fd's.
>   - multiple fds pointing to same internal tap queue are not going to
> get completed separately.
>   - when multi-proc is supported, limit of 253 fd's in Unix domain IPC
> comes into play
>   - tap does not support tx only fd for queues. If fd is queue of tap,
> receive fan out will go to it.
> 
> If DPDK was more flexible, harvesting of completion could be done via
> another thread but that is not general enough
> to work transparently with all applications.  Existing TAP device plays
> with SIGIO, but signals are slower.

I have now read up a bit about io_uring, so here are some thoughts and ideas...

To avoid locking, there should only be one writer of io_uring Submission Queue 
Events (SQE), and only one reader of io_uring Completion Queue Events (CQE) per 
TAP interface.

From what I understand, the TAP io_uring PMD only supports one RX queue per 
port and one TX queue per port (i.e. per TAP interface). We can take advantage 
of this:

We can use rte_tx() as the Submission Queue writer and rte_rx() as the 
Completion Queue reader.

The PMD must have two internal rte_rings for respectively RX refill and TX 
completion events.

rte_rx() does the following:
Read the Completion Queue;
If RX CQE, pass the data to the next RX MBUF, convert the RX CQE to an RX 
Refill SQE and enqueue it in the RX Refill rte_ring;
If TX CQE, enqueue it in the TX Completion rte_ring;
Repeat until nb_pkts RX CQEs have been received, or no more CQE's are 
available. (This complies with the rte_rx() API, which says that less than 
nb_pkts is only returned if no more packets are available for receiving.)

rte_tx() does the following:
Pass the data from the TX MBUFs to io_uring TX SQEs, using the TX CQEs in the 
TX Completion rte_ring, and write them to the io_uring Submission Queue.
Dequeue any RX Refill SQEs from the RX Refill rte_ring and write them to the 
io_uring Submission Queue.

This means that the application must call both rte_rx() and rte_tx(); but it 
would be allowed to call rte_tx() with zero MBUFs.

The internal rte_rings are Single-Producer, Single-Consumer, and large enough 
to hold all TX+RX descriptors.


Alternatively, we can let rte_rx() do all the work and use an rte_ring in the 
opposite direction...

The PMD must have two internal rte_rings, one for TX MBUFs and one for TX CQEs. 
(The latter can be a stack, or any other type of container.)

rte_tx() only does the following:
Enqueue the TX MBUFs to the TX MBUF rte_ring.

rte_rx() does the following:
Dequeue any TX MBUFs from the TX MBUF rte_ring, convert them to TX SQEs, using 
the TX CQEs in the TX Completion rte_ring, and write them to the io_uring 
Submission Queue.
Read the Completion Queue;
If TX CQE, enqueue it in the TX Completion rte_ring;
If RX CQE, pass the data to the next RX MBUF, convert the RX CQE to an RX 
Refill SQE and write it to the io_uring Submission Queue;
Repeat until nb_pkts RX CQEs have been received, or no more CQE's are 
available. (This complies with the rte_rx() API, which says that less than 
nb_pkts is only returned if no more packets are available for receiving.)

With the second design, the PMD can support multiple TX queues by using a 
Multi-Producer rte_ring for the TX MBUFs.
But it postpones all transmits until rte_rx() is called, so I don't really like 
it.

Of the two designs, the first feels more natural to me.
And if some application absolutely needs multiple TX queues, it can implement a 
Multi-Producer, Single-Consumer rte_ring as an intermediate step in front of 
the PMD's single TX queue.

Reply via email to