Re: [dpdk-dev] [PATCH] net/af_packet: fix ignoring full ring on tx

Ferruh Yigit Mon, 20 Sep 2021 10:11:54 -0700

On 9/6/2021 11:23 AM, Tudor Cornea wrote:
> Hi Ferruh,
> 
> Would you mind separate timestamp status fix to its own patch? I think
>> better to
>> fix 'ignoring full Tx ring' first, to not make it dependent to timestamp
>> patch.
> 
> 
> Agreed. There are two issues solved by this patch. We will break it in two
> different patches.
> 
> I can see 'TP_STATUS_TS_SYS_HARDWARE' is deprecated, and I assume in the
>> kernel
>> versions the bug exists, this flag is not set, but can you please confirm?
> 
> 
> And does it only seen with veth, if so I wonder if we can ignore it, not
>> sure
>> how common to use af_packet PMD over veth interface, do you have this
>> usecase?
> 
> 
> We've seen the timestamping issue only when running af_packet over
> veth interfaces. We have a particular use-case internally, in which we need
> to run inside a Kubernetes cluster.
> We've found the following resources [1] , [2] related to this behavior in
> the kernel.
> 
> We believe that issue #2 (the ring getting full), can theoretically occur
> on any type of NIC.
> We managed to reproduce the bursty behavior on af_packet PMD over vmxnet3
> interface, by Tx-ing packets at a low rate (e.g ~340 pps), and toggling the
> interface on / off
> ifconfig $iface_name down; sleep 10; ifconfig $iface_name up
> 
> We will attempt to give more context on the issue below, about what we
> think happens:
> - we have a 2048 queue shared between the kernel and the dpdk socket,
> there's an index the queue in both the kernel and the dpdk driver
> - the dpdk driver writes a packet or a burst, advances its idx and tells
> the kernel to send the packets via a call to sendto() and the kernel sends
> the packets and advances its idx
> - once the interface is down the kernel can no longer send packets, but it
> doesn't drop them, it just doesn't advance its idx
> - for each packet there is header and in the header there is a status
> integer which, among others, indicates the owner of the packet: the
> userspace or the kernel - the userspace (dpdk driver) sets the status as
> owned by the kernel when it adds another packet ; the kernel sets the
> status back to owned by the userspace once it sends a packet
> - the dpdk driver was ignoring this status bit and,  even after the queue
> was full, it would continue to put packets in the queue - its idx would be
> "after" that of the kernel
> - once the interface is brought up, the kernel would send all the packets
> in the queue (since they have the status of being owned by the kernel) on
> the next call to sendto() and the idx would be back to where it was before
> the interface was brought up (let's call it k1)
> - the dpdk driver idx into the queue would point somewhere in the queue
> (let's call it d1) and would continue to add packets at that point, but the
> kernel wouldn't send any packet anymore since there is now a gap of packets
> owned by the userspace between the kernel index (k1) and the dpdk driver
> idx (d1)
> - the dpdk idx would eventually reach k1 and packets would be transferred
> at a normal rate until both the dpdk idx and the kernel idx would reach d1
> again
> - between d1 and k1 there are only packets with the status as owned by the
> kernel - which where added by the dpdk driver while its index was between
> d1 and k1 ; thus the kernel would burst all the packets till k1, while the
> dpdk idx is at d1
> - the cycle repeats
> 
> If a new traffic config comes (in our application) while this cycle is
> happening, it could be that some of the packets of the old config are still
> in queue (between d1 and k1) and will be bursted when the dpdk and kernel
> idx reach d1 ; this would explain seeing packets from an old config, but
> only in the first 2048 packets (which is the queue size)
> 
>


Hi Tudor,

If there is an usage on of veth, OK to fix the timestamps issue.

What you described above looks like a ring buffer with single producer and
single consumer, and producer overwrites the not consumed items.

I assume this happens because af_packet (consumer) can't send the packets
because of the timestamp defect. (Also producer (dpdk app) should have checks to
prevent overwrite, but that is a different issue.)

I will comment to the new versions of the patches.

Our of curiosity, are you using an modified af_packet implementation in kernel
for above described usage?

> [1] https://www.spinics.net/lists/kernel/msg3959391.html
> [2] https://www.spinics.net/lists/netdev/msg739372.html
> 
> On Wed, 1 Sept 2021 at 19:34, Ferruh Yigit <[email protected]> wrote:
> 
>> On 8/20/2021 2:39 PM, Tudor Cornea wrote:
>>> The poll call can return POLLERR which is ignored, or it can return
>>> POLLOUT, even if there are no free frames in the mmap-ed area.
>>>
>>> We can account for both of these cases by re-checking if the next
>>> frame is empty before writing into it.
>>>
>>> We also now eliminate the timestamp status from the frame status.
>>>
>>
>> Hi Tudor,
>>
>> Would you mind separate timestamp status fix to its own patch? I think
>> better to
>> fix 'ignoring full Tx ring' first, to not make it dependent to timestamp
>> patch.
>>
>>> Signed-off-by: Mihai Pogonaru <[email protected]>
>>> Signed-off-by: Tudor Cornea <[email protected]>
>>> ---
>>>  drivers/net/af_packet/rte_eth_af_packet.c | 47
>> +++++++++++++++++++++++++++++--
>>>  1 file changed, 45 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c
>> b/drivers/net/af_packet/rte_eth_af_packet.c
>>> index b73b211..3845df5 100644
>>> --- a/drivers/net/af_packet/rte_eth_af_packet.c
>>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
>>> @@ -167,6 +167,12 @@ eth_af_packet_rx(void *queue, struct rte_mbuf
>> **bufs, uint16_t nb_pkts)
>>>       return num_rx;
>>>  }
>>>
>>> +static inline __u32 tx_ring_status_remove_ts(volatile __u32 *tp_status)
>>> +{
>>> +     return *tp_status &
>>> +             ~(TP_STATUS_TS_SOFTWARE | TP_STATUS_TS_RAW_HARDWARE);
>>
>> I can see 'TP_STATUS_TS_SYS_HARDWARE' is deprecated, and I assume in the
>> kernel
>> versions the bug exists, this flag is not set, but can you please confirm?
>>
>>> +}
>>> +
>>>  /*
>>>   * Callback to handle sending packets through a real NIC.
>>>   */
>>> @@ -211,9 +217,41 @@ eth_af_packet_tx(void *queue, struct rte_mbuf
>> **bufs, uint16_t nb_pkts)
>>>                       }
>>>               }
>>>
>>> +             /*
>>> +              * We must eliminate the timestamp status from the packet
>>> +              * status. This should only matter if timestamping is
>> enabled
>>> +              * on the socket, but there is a BUG in the kernel which is
>>> +              * fixed in newer releases.
>>> +
>>> +              * For interfaces of type 'veth', the sent skb is forwarded
>>> +              * to the peer and back into the network stack which
>> timestamps
>>> +              * it on the RX path if timestamping is enabled globally
>>> +              * (which happens if any socket enables timestamping).
>>> +
>>> +              * When the skb is destructed, tpacket_destruct_skb() is
>> called
>>> +              * and it calls __packet_set_timestamp() which doesn't
>> check
>>> +              * the flags on the socket and returns the timestamp if it
>> is
>>> +              * set in the skb (and for veth it is, as mentioned above).
>>> +              */
>>> +
>>
>> Can you give some more details on this bug, any link etc..
>>
>> And does it only seen with veth, if so I wonder if we can ignore it, not
>> sure
>> how common to use af_packet PMD over veth interface, do you have this
>> usecase?
>>
>> And if only specific kernel versions impacted from this bug, what do you
>> think
>> about adding kernel version check to reduce the scope of the fix in
>> af_packet PMD.
>>
>>>               /* point at the next incoming frame */
>>> -             if ((ppd->tp_status != TP_STATUS_AVAILABLE) &&
>>> -                 (poll(&pfd, 1, -1) < 0))
>>> +             if ((tx_ring_status_remove_ts(&ppd->tp_status)
>>> +                     != TP_STATUS_AVAILABLE) && (poll(&pfd, 1, -1) < 0))
>>> +                     break;
>>> +
>>> +             /*
>>> +              * Poll can return POLLERR if the interface is down or
>> POLLOUT,
>>> +              * even if there are no extra buffers available.
>>> +              * This happens, because packet_poll() calls
>> datagram_poll()
>>> +              * which checks the space left in the socket buffer and in
>> the
>>> +              * case of packet_mmap the default socket buffer length
>>> +              * doesn't match the requested size for the tx_ring so
>> there
>>> +              * is always space left in socket buffer, which doesn't
>> seem
>>> +              * to be correlated to the requested size for the tx_ring
>>> +              * in packet_mmap.
>>> +              */
>>> +             if (tx_ring_status_remove_ts(&ppd->tp_status)
>>> +                     != TP_STATUS_AVAILABLE)
>>>                       break;
>>
>> In this case should we break or poll again?
>>
>> If 'POLLERR' is received when interface is down, it makes sense to break.
>> Do you
>> know if is there any other case that 'POLLERR' is returned?
>>
>> And for 'POLLOUT', when exactly this event sent? If 'POLLOUT' received,
>> can we
>> poll again to wait Tx ring available to send more packets?
>>
>>>
>>>               /* copy the tx frame data */
>>> @@ -242,6 +280,11 @@ eth_af_packet_tx(void *queue, struct rte_mbuf
>> **bufs, uint16_t nb_pkts)
>>>               rte_pktmbuf_free(mbuf);
>>>       }
>>>
>>> +     /*
>>> +      * We might have to ignore a few more errnos here since the packets
>>> +      * remain in the mmap-ed queue and will be sent later, presumably.
>>> +      */
>>> +
>>
>> Can you please describe above comment more? What errors ignored?
>> And won't all packets sent will below sendto() call?
>>
>>>       /* kick-off transmits */
>>>       if (sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0) == -1 &&
>>>                       errno != ENOBUFS && errno != EAGAIN) {
>>>
>>
>>

Re: [dpdk-dev] [PATCH] net/af_packet: fix ignoring full ring on tx

Reply via email to