> > > > > > Probably the hardest part of using io_uring is figuring out how > > to > > > > > > collect > > > > > > completions. The simplest way would be to handle all > > completions rx > > > > and > > > > > > tx > > > > > > in the rx_burst function. > > > > > > > > > > Please don't mix RX and TX, unless explicitly requested by the > > > > application through the recently introduced "mbuf recycle" feature. > > > > > > > > The issue is Rx and Tx share a single fd and ioring for completion > > is > > > > per fd. > > > > The implementation for ioring came from the storage side so > > initially > > > > it was for fixing > > > > the broken Linux AIO support. > > > > > > > > Some other devices only have single interrupt or ring shared with > > rx/tx > > > > so not unique. > > > > Virtio, netvsc, and some NIC's. > > > > > > > > The problem is that if Tx completes descriptors then there needs to > > be > > > > locking > > > > to prevent Rx thread and Tx thread overlapping. And a spin lock is > > a > > > > performance buzz kill. > > > > > > Brainstorming a bit here... > > > What if the new TAP io_uring PMD is designed to use two io_urings per > > port, one for RX and another one for TX on the same TAP interface? > > > This requires that a TAP interface can be referenced via two file > > descriptors (one fd for the RX io_uring and another fd for the TX > > io_uring), e.g. by using dup() to create the additional file > > descriptor. I don't know if this is possible, and if it works with > > io_uring. > > > > There a couple of problems with multiple fd's. > > - multiple fds pointing to same internal tap queue are not going to > > get completed separately. > > - when multi-proc is supported, limit of 253 fd's in Unix domain IPC > > comes into play > > - tap does not support tx only fd for queues. If fd is queue of tap, > > receive fan out will go to it. > > > > If DPDK was more flexible, harvesting of completion could be done via > > another thread but that is not general enough > > to work transparently with all applications. Existing TAP device plays > > with SIGIO, but signals are slower. > > I have now read up a bit about io_uring, so here are some thoughts and > ideas... > > To avoid locking, there should only be one writer of io_uring Submission > Queue Events (SQE), and only one reader of io_uring > Completion Queue Events (CQE) per TAP interface. > > From what I understand, the TAP io_uring PMD only supports one RX queue per > port and one TX queue per port (i.e. per TAP > interface). We can take advantage of this: > > We can use rte_tx() as the Submission Queue writer and rte_rx() as the > Completion Queue reader. > > The PMD must have two internal rte_rings for respectively RX refill and TX > completion events. > > rte_rx() does the following: > Read the Completion Queue; > If RX CQE, pass the data to the next RX MBUF, convert the RX CQE to an RX > Refill SQE and enqueue it in the RX Refill rte_ring; > If TX CQE, enqueue it in the TX Completion rte_ring; > Repeat until nb_pkts RX CQEs have been received, or no more CQE's are > available. (This complies with the rte_rx() API, which says > that less than nb_pkts is only returned if no more packets are available for > receiving.) > > rte_tx() does the following: > Pass the data from the TX MBUFs to io_uring TX SQEs, using the TX CQEs in the > TX Completion rte_ring, and write them to the io_uring > Submission Queue. > Dequeue any RX Refill SQEs from the RX Refill rte_ring and write them to the > io_uring Submission Queue. > > This means that the application must call both rte_rx() and rte_tx(); but it > would be allowed to call rte_tx() with zero MBUFs. > > The internal rte_rings are Single-Producer, Single-Consumer, and large enough > to hold all TX+RX descriptors. > > > Alternatively, we can let rte_rx() do all the work and use an rte_ring in the > opposite direction... > > The PMD must have two internal rte_rings, one for TX MBUFs and one for TX > CQEs. (The latter can be a stack, or any other type of > container.) > > rte_tx() only does the following: > Enqueue the TX MBUFs to the TX MBUF rte_ring. > > rte_rx() does the following: > Dequeue any TX MBUFs from the TX MBUF rte_ring, convert them to TX SQEs, > using the TX CQEs in the TX Completion rte_ring, and > write them to the io_uring Submission Queue. > Read the Completion Queue; > If TX CQE, enqueue it in the TX Completion rte_ring; > If RX CQE, pass the data to the next RX MBUF, convert the RX CQE to an RX > Refill SQE and write it to the io_uring Submission Queue; > Repeat until nb_pkts RX CQEs have been received, or no more CQE's are > available. (This complies with the rte_rx() API, which says > that less than nb_pkts is only returned if no more packets are available for > receiving.) > > With the second design, the PMD can support multiple TX queues by using a > Multi-Producer rte_ring for the TX MBUFs. > But it postpones all transmits until rte_rx() is called, so I don't really > like it. > > Of the two designs, the first feels more natural to me. > And if some application absolutely needs multiple TX queues, it can implement > a Multi-Producer, Single-Consumer rte_ring as an > intermediate step in front of the PMD's single TX queue.
And why we can't simply have 2 io_uring(s): one for RX ops, second for TX ops?