On 04/07/2024 17:12, Johannes Berg wrote:
On Thu, 2024-07-04 at 13:22 +0100, anton.iva...@cambridgegreys.com
wrote:
@@ -675,11 +657,20 @@ static void prep_queue_for_rx(struct vector_queue *qi)
struct vector_private *vp = netdev_priv(qi->dev);
struct mmsghdr *mmsg_vector = qi->mmsg_vector;
void **skbuff_vector = qi->skbuff_vector;
- int i;
+ int i, queue_depth;
+
+ queue_depth = atomic_read(&qi->queue_depth);
- if (qi->queue_depth == 0)
+ if (queue_depth == 0)
return;
- for (i = 0; i < qi->queue_depth; i++) {
+
+ /* RX is always emptied 100% during each cycle, so we do not
+ * want to do the tail wraparound math for it
+ */
+
+ qi->head = qi->tail = 0;
+
+ for (i = 0; i < queue_depth; i++) {
/* it is OK if allocation fails - recvmmsg with NULL data in
* iov argument still performs an RX, just drops the packet
* This allows us stop faffing around with a "drop buffer"
@@ -689,7 +680,8 @@ static void prep_queue_for_rx(struct vector_queue *qi)
skbuff_vector++;
mmsg_vector++;
}
- qi->queue_depth = 0;
+ wmb(); /* Ensure that RX processing elsewhere sees the changes */
+ atomic_set(&qi->queue_depth, 0);
}
I don't understand this.
prep_queue_for_rx() is called by vector_mmsg_rx(), not in parallel or in
a different thread or something, inside the NAPI polling. All it does is
reset the queue to empty with all the SKBs allocated.
After that, prep_queue_for_rx() calls uml_vector_recvmmsg() [1] which
fills the SKBs, and then vector_mmsg_rx() itself consumes them by going
over it and calling napi_gro_receive() (or whatever else).
There's no parallelism here? The RX queue wouldn't even need locking or
atomic instructions at all at this point, since it's just "refill
buffers, fill buffers, release buffers to network stack".
What am I missing?
You are not missing anything.
The rx and tx are using the same infra to map vectors to skbs arrays.
I can make the RX use a set of lockless and atomicless functions, but
this means duplicating some of the code.
[1] btw, it should pass 'budget' to uml_vector_recvmmsg(), not the depth
(qi->max_depth), and possibly even pass that to prep_queue_for_rx(), but
that's a different patch.
Good point. Let's sort out the locking and parallelism first.
static struct vector_device *find_device(int n)
@@ -987,7 +979,7 @@ static int vector_mmsg_rx(struct vector_private *vp, int
budget)
* many do we need to prep the next time prep_queue_for_rx() is called.
*/
- qi->queue_depth = packet_count;
+ atomic_add(packet_count, &qi->queue_depth);
and that also means my earlier analysis was just confused - there's no
need for _add here, _set is fine - in fact even non-atomic would be
fine.
@@ -1176,6 +1168,7 @@ static int vector_poll(struct napi_struct *napi, int
budget)
if ((vp->options & VECTOR_TX) != 0)
tx_enqueued = (vector_send(vp->tx_queue) > 0);
+ spin_lock(&vp->rx_queue->head_lock);
if ((vp->options & VECTOR_RX) > 0)
err = vector_mmsg_rx(vp, budget);
else {
@@ -1183,6 +1176,7 @@ static int vector_poll(struct napi_struct *napi, int
budget)
if (err > 0)
err = 1;
}
+ spin_unlock(&vp->rx_queue->head_lock);
that's needed for ... stats?
Yes, unfortunately.
I am reusing the lock.
And that one may be accessed in parallel :(
johannes
--
Anton R. Ivanov
https://www.kot-begemot.co.uk/