>Neil Horman <nhor...@tuxdriver.com> writes: > On Fri, Jul 12, 2019 at 02:33:29PM +0200, Toke Høiland-Jørgensen wrote: >> Neil Horman <nhor...@tuxdriver.com> writes: >> >> > On Fri, Jul 12, 2019 at 11:27:55AM +0200, Toke Høiland-Jørgensen wrote: >> >> Neil Horman <nhor...@tuxdriver.com> writes: >> >> >> >> > On Thu, Jul 11, 2019 at 03:39:09PM +0300, Ido Schimmel wrote: >> >> >> On Sun, Jul 07, 2019 at 12:45:41PM -0700, David Miller wrote: >> >> >> > From: Ido Schimmel <ido...@idosch.org> >> >> >> > Date: Sun, 7 Jul 2019 10:58:17 +0300 >> >> >> > >> >> >> > > Users have several ways to debug the kernel and understand why a >> >> >> > > packet >> >> >> > > was dropped. For example, using "drop monitor" and "perf". Both >> >> >> > > utilities trace kfree_skb(), which is the function called when a >> >> >> > > packet >> >> >> > > is freed as part of a failure. The information provided by these >> >> >> > > tools >> >> >> > > is invaluable when trying to understand the cause of a packet loss. >> >> >> > > >> >> >> > > In recent years, large portions of the kernel data path were >> >> >> > > offloaded >> >> >> > > to capable devices. Today, it is possible to perform L2 and L3 >> >> >> > > forwarding in hardware, as well as tunneling (IP-in-IP and VXLAN). >> >> >> > > Different TC classifiers and actions are also offloaded to capable >> >> >> > > devices, at both ingress and egress. >> >> >> > > >> >> >> > > However, when the data path is offloaded it is not possible to >> >> >> > > achieve >> >> >> > > the same level of introspection as tools such "perf" and "drop >> >> >> > > monitor" >> >> >> > > become irrelevant. >> >> >> > > >> >> >> > > This patchset aims to solve this by allowing users to monitor >> >> >> > > packets >> >> >> > > that the underlying device decided to drop along with relevant >> >> >> > > metadata >> >> >> > > such as the drop reason and ingress port. >> >> >> > >> >> >> > We are now going to have 5 or so ways to capture packets passing >> >> >> > through >> >> >> > the system, this is nonsense. >> >> >> > >> >> >> > AF_PACKET, kfree_skb drop monitor, perf, XDP perf events, and now >> >> >> > this >> >> >> > devlink thing. >> >> >> > >> >> >> > This is insanity, too many ways to do the same thing and therefore >> >> >> > the >> >> >> > worst possible user experience. >> >> >> > >> >> >> > Pick _ONE_ method to trap packets and forward normal kfree_skb >> >> >> > events, >> >> >> > XDP perf events, and these taps there too. >> >> >> > >> >> >> > I mean really, think about it from the average user's perspective. >> >> >> > To >> >> >> > see all drops/pkts I have to attach a kfree_skb tracepoint, and not >> >> >> > just >> >> >> > listen on devlink but configure a special tap thing beforehand and >> >> >> > then >> >> >> > if someone is using XDP I gotta setup another perf event buffer >> >> >> > capture >> >> >> > thing too. >> >> >> >> >> >> Dave, >> >> >> >> >> >> Before I start working on v2, I would like to get your feedback on the >> >> >> high level plan. Also adding Neil who is the maintainer of drop_monitor >> >> >> (and counterpart DropWatch tool [1]). >> >> >> >> >> >> IIUC, the problem you point out is that users need to use different >> >> >> tools to monitor packet drops based on where these drops occur >> >> >> (SW/HW/XDP). >> >> >> >> >> >> Therefore, my plan is to extend the existing drop_monitor netlink >> >> >> channel to also cover HW drops. I will add a new message type and a new >> >> >> multicast group for HW drops and encode in the message what is >> >> >> currently >> >> >> encoded in the devlink events. >> >> >> >> >> > A few things here: >> >> > IIRC we don't announce individual hardware drops, drivers record them in >> >> > internal structures, and they are retrieved on demand via ethtool >> >> > calls, so you >> >> > will either need to include some polling (probably not a very >> >> > performant idea), >> >> > or some sort of flagging mechanism to indicate that on the next message >> >> > sent to >> >> > user space you should go retrieve hw stats from a given interface. I >> >> > certainly >> >> > wouldn't mind seeing this happen, but its more work than just adding a >> >> > new >> >> > netlink message. >> >> > >> >> > Also, regarding XDP drops, we wont see them if the xdp program is >> >> > offloaded to >> >> > hardware (you'll need your hw drop gathering mechanism for that), but >> >> > for xdp >> >> > programs run on the cpu, dropwatch should alrady catch those. I.e. if >> >> > the xdp >> >> > program returns a DROP result for a packet being processed, the OS will >> >> > call >> >> > kfree_skb on its behalf, and dropwatch wil call that. >> >> >> >> There is no skb by the time an XDP program runs, so this is not true. As >> >> I mentioned upthread, there's a tracepoint that will get called if an >> >> error occurs (or the program returns XDP_ABORTED), but in most cases, >> >> XDP_DROP just means that the packet silently disappears... >> >> >> > As I noted, thats only true for xdp programs that are offloaded to >> > hardware, I >> > was only speaking for XDP programs that run on the cpu. For the former >> > case, we >> > obviously need some other mechanism to detect drops, but for cpu executed >> > xdp >> > programs, the OS is responsible for freeing skbs associated with programs >> > the >> > return XDP_DROP. >> >> Ah, I think maybe you're thinking of generic XDP (also referred to as >> skb mode)? That is a separate mode; an XDP program loaded in "native > Yes, was I not clear about that?
No, not really. "Generic XDP" is not the same as "XDP"; the generic mode is more of a debug mode (as far as I'm concerned at least). So in the common case, it is absolutely not the case that the kernel will end up calling kfree_skb after an XDP_DROP; so I got somewhat thrown off by your insistence that it would... :) -Toke