Hi, Can someone help me figure out a performance issue that seems to be caused by an RTL8168g/8111g NIC that seems to be batching up transmissions - or, at least, not starting immediately that it's given something to transmit?
The setup that I'm dealing with is an AFS filesystem client (the test machine with the troublesome NIC) and an AFS server machine connected by GigE through a switch by 2m cables. The network is pretty much uncontended and the server is only serving files to the test machine. AFS uses the RxRPC protocol which runs over UDP, one RxRPC packet per UDP packet. RxRPC has two packet types that are of relevance to this issue: DATA and ACK. Linux contains an implementation of RxRPC as the AF_RXRPC socket family and an AFS filesystem that can be found in fs/afs/. The symptoms I'm seeing are that whilst the client is doing a multi-megabyte direct-I/O reads over AFS, the server is occasionally stalling in its sending of DATA packets because the ACK packets being sent by the client are appearing in batches on the server. I stuck some tracepoints into the r8169 driver to investigate the issue (see attached patch). Here's an excerpt from the trace (note that I didn't enable the rx tracepoint): dd-3179 [000] .N.3 37.095116: net_rtl8169_tx: enp3s0 p=178-179 skb=ba987d36 nd-2961 [001] d.h1 37.095250: net_rtl8169_interrupt: enp3s0 st=81 nd-2961 [001] ..s1 37.095253: net_rtl8169_poll: enp3s0 st=81 dd-3179 [000] ...3 37.095286: net_rtl8169_tx: enp3s0 p=178-180 skb=44d352cf dd-3179 [000] .N.3 37.095307: net_rtl8169_tx: enp3s0 p=178-181 skb=53ac88f1 dd-3179 [000] .N.3 37.095315: net_rtl8169_tx: enp3s0 p=178-182 skb=4f3eedd9 dd-3179 [000] .N.3 37.095328: net_rtl8169_tx: enp3s0 p=178-183 skb=38c81784 dd-3179 [000] .N.3 37.095338: net_rtl8169_tx: enp3s0 p=178-184 skb=1f0a8fc3 dd-3179 [000] .N.3 37.095345: net_rtl8169_tx: enp3s0 p=178-185 skb=281a484d dd-3179 [000] .N.3 37.095362: net_rtl8169_tx: enp3s0 p=178-186 skb=13d8a01a dd-3179 [000] .N.3 37.095370: net_rtl8169_tx: enp3s0 p=178-187 skb=68fe3d70 dd-3179 [000] .N.3 37.095382: net_rtl8169_tx: enp3s0 p=178-188 skb=3cf64dd8 dd-3179 [000] .N.3 37.095390: net_rtl8169_tx: enp3s0 p=178-189 skb=da35591a dd-3179 [000] .N.3 37.095403: net_rtl8169_tx: enp3s0 p=178-190 skb=3013974a dd-3179 [000] .N.3 37.095410: net_rtl8169_tx: enp3s0 p=178-191 skb=7ecddd8e dd-3179 [000] .N.3 37.095427: net_rtl8169_tx: enp3s0 p=178-192 skb=aa30c686 nd-2961 [001] d.h1 37.095433: net_rtl8169_interrupt: enp3s0 st=85 nd-2961 [001] ..s1 37.095434: net_rtl8169_poll: enp3s0 st=85 dd-3179 [000] .N.3 37.095435: net_rtl8169_tx: enp3s0 p=178-193 skb=006b0947 nd-2961 [001] ..s1 37.095439: net_rtl8169_tx_done: enp3s0 p=178 skb=ba987d36 nd-2961 [001] ..s1 37.095440: net_rtl8169_tx_done: enp3s0 p=179 skb=44d352cf nd-2961 [001] ..s1 37.095441: net_rtl8169_tx_done: enp3s0 p=180 skb=53ac88f1 nd-2961 [001] ..s1 37.095441: net_rtl8169_tx_done: enp3s0 p=181 skb=4f3eedd9 nd-2961 [001] ..s1 37.095442: net_rtl8169_tx_done: enp3s0 p=182 skb=38c81784 nd-2961 [001] ..s1 37.095443: net_rtl8169_tx_done: enp3s0 p=183 skb=1f0a8fc3 nd-2961 [001] ..s1 37.095443: net_rtl8169_tx_done: enp3s0 p=184 skb=281a484d nd-2961 [001] ..s1 37.095444: net_rtl8169_tx_done: enp3s0 p=185 skb=13d8a01a nd-2961 [001] ..s1 37.095445: net_rtl8169_tx_done: enp3s0 p=186 skb=68fe3d70 nd-2961 [001] ..s1 37.095445: net_rtl8169_tx_done: enp3s0 p=187 skb=3cf64dd8 nd-2961 [001] ..s1 37.095446: net_rtl8169_tx_done: enp3s0 p=188 skb=da35591a nd-2961 [001] ..s1 37.095446: net_rtl8169_tx_done: enp3s0 p=189 skb=3013974a nd-2961 [001] ..s1 37.095447: net_rtl8169_tx_done: enp3s0 p=190 skb=7ecddd8e As can be seen, there are a number of packets being transmitted by AF_RXRPC from the dd command. Each of these is an ACK packet. ACK packets are around 850-900 bits in size on the wire according to wireshark. The first Tx interrupt (status 0x85: TxOk is 0x04) occurs 317uS after the packet 178 is added to an empty buffer and is deleted 6uS after that. Packet 190 is deleted 37uS after being added. What surprises me is that it would seem that an ACK packet should take about ~1uS to transmit and the average interval between ACK packets being added to the queue is ~12uS, but the ring is not obviously making progress in being transmitted. Now, I understand that the dirty_tx and cur_tx ring indices are private to the driver and not seen by the device, but I added an extra bit of code to count up the number of descriptors with DescOwn still set, and it always appears to match the number of packets in the buffer. So I'm guessing that the NIC does one of a number of things: (1) It delays starting transmission until it's got a large enough batch. (2) It delays starting transmission until some time has passed since being poked. (3) It takes time to get the transmitter going for some reason. (4) It's batching the update of the Tx ring descriptors and does this right before interrupting the CPU. (5) Reading/writing the Tx descriptor ring from the device is slow. But again note that tshark on the server shows the ACKs arriving as a batch - I've also instrumented the NIC driver for that machine and observed the NIC apparently receiving ACK packets in batches. Also, what does the TxDescUnavail status flag signify? The NIC is setting it, but nothing in the driver takes any notice of it or clears it. Thanks, David --- commit 4ce09fa309d8fb4b1b910091607c2290c45541f1 Author: David Howells <dhowe...@redhat.com> Date: Tue Oct 2 21:14:04 2018 +0100 Add tracepoints for Realtek r8169 diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index ab30aaeac6d3..8d3a86a674b7 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -31,6 +31,9 @@ #include <linux/ipv6.h> #include <net/ip6_checksum.h> +#define CREATE_TRACE_POINTS +#include <trace/events/net_rtl8169.h> + #define MODULENAME "r8169" #define FIRMWARE_8168D_1 "rtl_nic/rtl8168d-1.fw" @@ -6197,6 +6200,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb, RTL_W8(tp, TxPoll, NPQ); mmiowb(); + trace_net_rtl8169_tx(tp->dev, tp->dirty_tx, tp->cur_tx, skb); if (!TX_FRAGS_READY_FOR(tp, MAX_SKB_FRAGS)) { /* Avoid wrongly optimistic queue wake-up: rtl_tx thread must @@ -6302,6 +6306,7 @@ static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp) rtl8169_unmap_tx_skb(tp_to_dev(tp), tx_skb, tp->TxDescArray + entry); if (status & LastFrag) { + trace_net_rtl8169_tx_done(dev, dirty_tx, tx_skb->skb); u64_stats_update_begin(&tp->tx_stats.syncp); tp->tx_stats.packets++; tp->tx_stats.bytes += tx_skb->skb->len; @@ -6453,6 +6458,7 @@ static int rtl_rx(struct net_device *dev, struct rtl8169_private *tp, u32 budget dev->stats.multicast++; napi_gro_receive(&tp->napi, skb); + trace_net_rtl8169_napi_rx(dev, skb); u64_stats_update_begin(&tp->rx_stats.syncp); tp->rx_stats.packets++; @@ -6466,7 +6472,6 @@ static int rtl_rx(struct net_device *dev, struct rtl8169_private *tp, u32 budget count = cur_rx - tp->cur_rx; tp->cur_rx = cur_rx; - return count; } @@ -6478,6 +6483,8 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance) if (status == 0xffff || !(status & (RTL_EVENT_NAPI | tp->event_slow))) return IRQ_NONE; + trace_net_rtl8169_interrupt(tp->napi.dev, status); + rtl_irq_disable(tp); napi_schedule_irqoff(&tp->napi); @@ -6559,6 +6566,7 @@ static int rtl8169_poll(struct napi_struct *napi, int budget) status = rtl_get_events(tp); rtl_ack_events(tp, status & ~tp->event_slow); + trace_net_rtl8169_poll(dev, status); if (status & RTL_EVENT_NAPI_RX) work_done = rtl_rx(dev, tp, (u32) budget); diff --git a/include/trace/events/net_rtl8169.h b/include/trace/events/net_rtl8169.h new file mode 100644 index 000000000000..92049078b3fa --- /dev/null +++ b/include/trace/events/net_rtl8169.h @@ -0,0 +1,125 @@ +/* Realtek RTL8169 tracepoints + * + * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved. + * Written by David Howells (dhowe...@redhat.com) + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public Licence + * as published by the Free Software Foundation; either version + * 2 of the Licence, or (at your option) any later version. + */ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM net_rtl8169 + +#if !defined(_TRACE_NET_RTL8169_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_NET_RTL8169_H + +#include <linux/tracepoint.h> +#include <linux/errqueue.h> + + +TRACE_EVENT(net_rtl8169_interrupt, + TP_PROTO(struct net_device *netdev, u16 status), + + TP_ARGS(netdev, status), + + TP_STRUCT__entry( + __field(u16, status ) + __array(char, name, IFNAMSIZ ) + ), + + TP_fast_assign( + __entry->status = status; + memcpy(__entry->name, netdev->name, IFNAMSIZ); + ), + + TP_printk("%s st=%x", __entry->name, __entry->status) + ); + +TRACE_EVENT(net_rtl8169_poll, + TP_PROTO(struct net_device *netdev, u16 status), + + TP_ARGS(netdev, status), + + TP_STRUCT__entry( + __field(u16, status ) + __array(char, name, IFNAMSIZ ) + ), + + TP_fast_assign( + __entry->status = status; + memcpy(__entry->name, netdev->name, IFNAMSIZ); + ), + + TP_printk("%s st=%x", __entry->name, __entry->status) + ); + +TRACE_EVENT(net_rtl8169_napi_rx, + TP_PROTO(struct net_device *netdev, struct sk_buff *skb), + + TP_ARGS(netdev, skb), + + TP_STRUCT__entry( + __field(struct sk_buff *, skb ) + __array(char, name, IFNAMSIZ ) + ), + + TP_fast_assign( + __entry->skb = skb; + memcpy(__entry->name, netdev->name, IFNAMSIZ); + ), + + TP_printk("%s skb=%p", __entry->name, __entry->skb) + ); + +TRACE_EVENT(net_rtl8169_tx_done, + TP_PROTO(struct net_device *netdev, unsigned int dirty_tx, + struct sk_buff *skb), + + TP_ARGS(netdev, dirty_tx, skb), + + TP_STRUCT__entry( + __field(struct sk_buff *, skb) + __field(unsigned int, dirty_tx) + __array(char, name, IFNAMSIZ ) + ), + + TP_fast_assign( + __entry->skb = skb; + __entry->dirty_tx = dirty_tx; + memcpy(__entry->name, netdev->name, IFNAMSIZ); + ), + + TP_printk("%s p=%u skb=%p", + __entry->name, __entry->dirty_tx, __entry->skb) + ); + +TRACE_EVENT(net_rtl8169_tx, + TP_PROTO(struct net_device *netdev, unsigned int dirty_tx, + unsigned int cur_tx, struct sk_buff *skb), + + TP_ARGS(netdev, dirty_tx, cur_tx, skb), + + TP_STRUCT__entry( + __field(struct sk_buff *, skb ) + __field(unsigned int, dirty_tx ) + __field(unsigned int, cur_tx ) + __array(char, name, IFNAMSIZ ) + ), + + TP_fast_assign( + __entry->skb = skb; + __entry->dirty_tx = dirty_tx; + __entry->cur_tx = cur_tx; + memcpy(__entry->name, netdev->name, IFNAMSIZ); + ), + + TP_printk("%s p=%u-%u skb=%p", + __entry->name, __entry->dirty_tx, __entry->cur_tx, + __entry->skb) + ); + +#endif /* _TRACE_NET_RTL8169_H */ + +/* This part must be outside protection */ +#include <trace/define_trace.h>