On Mon, Jul 10, 2017 at 1:23 PM, John Fastabend <john.fastab...@gmail.com> wrote: > On 07/09/2017 06:37 AM, Saeed Mahameed wrote: >> >> >> On 7/7/2017 8:35 PM, John Fastabend wrote: >>> This adds support for a bpf_redirect helper function to the XDP >>> infrastructure. For now this only supports redirecting to the egress >>> path of a port. >>> >>> In order to support drivers handling a xdp_buff natively this patches >>> uses a new ndo operation ndo_xdp_xmit() that takes pushes a xdp_buff >>> to the specified device. >>> >>> If the program specifies either (a) an unknown device or (b) a device >>> that does not support the operation a BPF warning is thrown and the >>> XDP_ABORTED error code is returned. >>> >>> Signed-off-by: John Fastabend <john.fastab...@gmail.com> >>> Acked-by: Daniel Borkmann <dan...@iogearbox.net> >>> --- > > [...] > >>> >>> +static int __bpf_tx_xdp(struct net_device *dev, struct xdp_buff *xdp) >>> +{ >>> + if (dev->netdev_ops->ndo_xdp_xmit) { >>> + dev->netdev_ops->ndo_xdp_xmit(dev, xdp); >> >> Hi John, >> >> I have some concern here regarding synchronizing between the >> redirecting device and the target device: >> >> if the target device's NAPI is also doing XDP_TX on the same XDP TX >> ring which this NDO might be redirecting xdp packets into the same >> ring, there would be a race accessing this ring resources (buffers >> and descriptors). Maybe you addressed this issue in the device driver >> implementation of this ndo or with some NAPI tricks/assumptions, I >> guess we have the same issue for if you run the same program to >> redirect traffic from multiple netdevices into one netdevice, how do >> you synchronize accessing this TX ring ? > > The implementation uses a per cpu TX ring to resolve these races. And > the pair of driver interface API calls, xdp_do_redirect() and > xdp_do_flush_map() > must be completed in a single poll() handler. > > This comment was included in the header file to document this, > > /* The pair of xdp_do_redirect and xdp_do_flush_map MUST be called in the > * same cpu context. Further for best results no more than a single map > * for the do_redirect/do_flush pair should be used. This limitation is > * because we only track one map and force a flush when the map changes. > * This does not appear to be a real limitation for existing software. > */ > > In general some documentation about implementing XDP would probably be > useful to add in Documentation/networking but this IMO goes beyond just > this patch series. > >> >> Maybe we need some clear guidelines in this ndo documentation stating >> how to implement this ndo and what are the assumptions on those XDP >> TX redirect rings or from which context this ndo can run. >> >> can you please elaborate. > > I think the best implementation is to use a per cpu TX ring as I did in > this series. If your device is limited by the number of queues for some > reason some other scheme would need to be devised. Unfortunately, the only > thing I've come up for this case (using only this series) would both impact > performance and make the code complex. > > A nice solution might be to constrain networking "tasks" to only a subset > of cores. For 64+ core systems this might be a good idea. It would allow > avoiding locking using per_cpu logic but also avoid networking consuming > slices of every core in the system. As core count goes up I think we will > eventually need to address this.I believe Eric was thinking along these > lines with his netconf talk iirc. Obviously this work is way outside the > scope of this series though.
I agree that it is outside the scope of this series, but I think it is important to consider the impact of the output queue selection in both a heterogenous and homogenous driver setup and how tx could be optimized or even considered to be more reliable and I think that was part of Saeed's point. I got base redirect support for bnxt_en working yesterday, but for it and other drivers that do not necessarily create a ring/queue per core like ixgbe there is probably a bit more to work in each driver to properly track output tx rings/queues than what you have done with ixgbe. > > >> Thanks, >> Saeed. >>