On 9/6/2021 11:23 AM, Tudor Cornea wrote: > Hi Ferruh, > > Would you mind separate timestamp status fix to its own patch? I think >> better to >> fix 'ignoring full Tx ring' first, to not make it dependent to timestamp >> patch. > > > Agreed. There are two issues solved by this patch. We will break it in two > different patches. > > I can see 'TP_STATUS_TS_SYS_HARDWARE' is deprecated, and I assume in the >> kernel >> versions the bug exists, this flag is not set, but can you please confirm? > > > And does it only seen with veth, if so I wonder if we can ignore it, not >> sure >> how common to use af_packet PMD over veth interface, do you have this >> usecase? > > > We've seen the timestamping issue only when running af_packet over > veth interfaces. We have a particular use-case internally, in which we need > to run inside a Kubernetes cluster. > We've found the following resources [1] , [2] related to this behavior in > the kernel. > > We believe that issue #2 (the ring getting full), can theoretically occur > on any type of NIC. > We managed to reproduce the bursty behavior on af_packet PMD over vmxnet3 > interface, by Tx-ing packets at a low rate (e.g ~340 pps), and toggling the > interface on / off > ifconfig $iface_name down; sleep 10; ifconfig $iface_name up > > We will attempt to give more context on the issue below, about what we > think happens: > - we have a 2048 queue shared between the kernel and the dpdk socket, > there's an index the queue in both the kernel and the dpdk driver > - the dpdk driver writes a packet or a burst, advances its idx and tells > the kernel to send the packets via a call to sendto() and the kernel sends > the packets and advances its idx > - once the interface is down the kernel can no longer send packets, but it > doesn't drop them, it just doesn't advance its idx > - for each packet there is header and in the header there is a status > integer which, among others, indicates the owner of the packet: the > userspace or the kernel - the userspace (dpdk driver) sets the status as > owned by the kernel when it adds another packet ; the kernel sets the > status back to owned by the userspace once it sends a packet > - the dpdk driver was ignoring this status bit and, even after the queue > was full, it would continue to put packets in the queue - its idx would be > "after" that of the kernel > - once the interface is brought up, the kernel would send all the packets > in the queue (since they have the status of being owned by the kernel) on > the next call to sendto() and the idx would be back to where it was before > the interface was brought up (let's call it k1) > - the dpdk driver idx into the queue would point somewhere in the queue > (let's call it d1) and would continue to add packets at that point, but the > kernel wouldn't send any packet anymore since there is now a gap of packets > owned by the userspace between the kernel index (k1) and the dpdk driver > idx (d1) > - the dpdk idx would eventually reach k1 and packets would be transferred > at a normal rate until both the dpdk idx and the kernel idx would reach d1 > again > - between d1 and k1 there are only packets with the status as owned by the > kernel - which where added by the dpdk driver while its index was between > d1 and k1 ; thus the kernel would burst all the packets till k1, while the > dpdk idx is at d1 > - the cycle repeats > > If a new traffic config comes (in our application) while this cycle is > happening, it could be that some of the packets of the old config are still > in queue (between d1 and k1) and will be bursted when the dpdk and kernel > idx reach d1 ; this would explain seeing packets from an old config, but > only in the first 2048 packets (which is the queue size) > >
Hi Tudor, If there is an usage on of veth, OK to fix the timestamps issue. What you described above looks like a ring buffer with single producer and single consumer, and producer overwrites the not consumed items. I assume this happens because af_packet (consumer) can't send the packets because of the timestamp defect. (Also producer (dpdk app) should have checks to prevent overwrite, but that is a different issue.) I will comment to the new versions of the patches. Our of curiosity, are you using an modified af_packet implementation in kernel for above described usage? > [1] https://www.spinics.net/lists/kernel/msg3959391.html > [2] https://www.spinics.net/lists/netdev/msg739372.html > > On Wed, 1 Sept 2021 at 19:34, Ferruh Yigit <ferruh.yi...@intel.com> wrote: > >> On 8/20/2021 2:39 PM, Tudor Cornea wrote: >>> The poll call can return POLLERR which is ignored, or it can return >>> POLLOUT, even if there are no free frames in the mmap-ed area. >>> >>> We can account for both of these cases by re-checking if the next >>> frame is empty before writing into it. >>> >>> We also now eliminate the timestamp status from the frame status. >>> >> >> Hi Tudor, >> >> Would you mind separate timestamp status fix to its own patch? I think >> better to >> fix 'ignoring full Tx ring' first, to not make it dependent to timestamp >> patch. >> >>> Signed-off-by: Mihai Pogonaru <pogonarumi...@gmail.com> >>> Signed-off-by: Tudor Cornea <tudor.cor...@gmail.com> >>> --- >>> drivers/net/af_packet/rte_eth_af_packet.c | 47 >> +++++++++++++++++++++++++++++-- >>> 1 file changed, 45 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c >> b/drivers/net/af_packet/rte_eth_af_packet.c >>> index b73b211..3845df5 100644 >>> --- a/drivers/net/af_packet/rte_eth_af_packet.c >>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c >>> @@ -167,6 +167,12 @@ eth_af_packet_rx(void *queue, struct rte_mbuf >> **bufs, uint16_t nb_pkts) >>> return num_rx; >>> } >>> >>> +static inline __u32 tx_ring_status_remove_ts(volatile __u32 *tp_status) >>> +{ >>> + return *tp_status & >>> + ~(TP_STATUS_TS_SOFTWARE | TP_STATUS_TS_RAW_HARDWARE); >> >> I can see 'TP_STATUS_TS_SYS_HARDWARE' is deprecated, and I assume in the >> kernel >> versions the bug exists, this flag is not set, but can you please confirm? >> >>> +} >>> + >>> /* >>> * Callback to handle sending packets through a real NIC. >>> */ >>> @@ -211,9 +217,41 @@ eth_af_packet_tx(void *queue, struct rte_mbuf >> **bufs, uint16_t nb_pkts) >>> } >>> } >>> >>> + /* >>> + * We must eliminate the timestamp status from the packet >>> + * status. This should only matter if timestamping is >> enabled >>> + * on the socket, but there is a BUG in the kernel which is >>> + * fixed in newer releases. >>> + >>> + * For interfaces of type 'veth', the sent skb is forwarded >>> + * to the peer and back into the network stack which >> timestamps >>> + * it on the RX path if timestamping is enabled globally >>> + * (which happens if any socket enables timestamping). >>> + >>> + * When the skb is destructed, tpacket_destruct_skb() is >> called >>> + * and it calls __packet_set_timestamp() which doesn't >> check >>> + * the flags on the socket and returns the timestamp if it >> is >>> + * set in the skb (and for veth it is, as mentioned above). >>> + */ >>> + >> >> Can you give some more details on this bug, any link etc.. >> >> And does it only seen with veth, if so I wonder if we can ignore it, not >> sure >> how common to use af_packet PMD over veth interface, do you have this >> usecase? >> >> And if only specific kernel versions impacted from this bug, what do you >> think >> about adding kernel version check to reduce the scope of the fix in >> af_packet PMD. >> >>> /* point at the next incoming frame */ >>> - if ((ppd->tp_status != TP_STATUS_AVAILABLE) && >>> - (poll(&pfd, 1, -1) < 0)) >>> + if ((tx_ring_status_remove_ts(&ppd->tp_status) >>> + != TP_STATUS_AVAILABLE) && (poll(&pfd, 1, -1) < 0)) >>> + break; >>> + >>> + /* >>> + * Poll can return POLLERR if the interface is down or >> POLLOUT, >>> + * even if there are no extra buffers available. >>> + * This happens, because packet_poll() calls >> datagram_poll() >>> + * which checks the space left in the socket buffer and in >> the >>> + * case of packet_mmap the default socket buffer length >>> + * doesn't match the requested size for the tx_ring so >> there >>> + * is always space left in socket buffer, which doesn't >> seem >>> + * to be correlated to the requested size for the tx_ring >>> + * in packet_mmap. >>> + */ >>> + if (tx_ring_status_remove_ts(&ppd->tp_status) >>> + != TP_STATUS_AVAILABLE) >>> break; >> >> In this case should we break or poll again? >> >> If 'POLLERR' is received when interface is down, it makes sense to break. >> Do you >> know if is there any other case that 'POLLERR' is returned? >> >> And for 'POLLOUT', when exactly this event sent? If 'POLLOUT' received, >> can we >> poll again to wait Tx ring available to send more packets? >> >>> >>> /* copy the tx frame data */ >>> @@ -242,6 +280,11 @@ eth_af_packet_tx(void *queue, struct rte_mbuf >> **bufs, uint16_t nb_pkts) >>> rte_pktmbuf_free(mbuf); >>> } >>> >>> + /* >>> + * We might have to ignore a few more errnos here since the packets >>> + * remain in the mmap-ed queue and will be sent later, presumably. >>> + */ >>> + >> >> Can you please describe above comment more? What errors ignored? >> And won't all packets sent will below sendto() call? >> >>> /* kick-off transmits */ >>> if (sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0) == -1 && >>> errno != ENOBUFS && errno != EAGAIN) { >>> >> >>