On Thu, Sep 8, 2016 at 10:38 AM, Jesper Dangaard Brouer <bro...@redhat.com> wrote: > On Wed, 7 Sep 2016 23:55:42 +0300 > Or Gerlitz <gerlitz...@gmail.com> wrote: > >> On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed <sae...@mellanox.com> wrote: >> > From: Rana Shahout <ra...@mellanox.com> >> > >> > Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver. >> > >> > When XDP is on we make sure to change channels RQs type to >> > MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to >> > ensure "page per packet". >> > >> > On XDP set, we fail if HW LRO is set and request from user to turn it >> > off. Since on ConnectX4-LX HW LRO is always on by default, this will be >> > annoying, but we prefer not to enforce LRO off from XDP set function. >> > >> > Full channels reset (close/open) is required only when setting XDP >> > on/off. >> > >> > When XDP set is called just to exchange programs, we will update >> > each RQ xdp program on the fly and for synchronization with current >> > data path RX activity of that RQ, we temporally disable that RQ and >> > ensure RX path is not running, quickly update and re-enable that RQ, >> > for that we do: >> > - rq.state = disabled >> > - napi_synnchronize >> > - xchg(rq->xdp_prg) >> > - rq.state = enabled >> > - napi_schedule // Just in case we've missed an IRQ >> > >> > Packet rate performance testing was done with pktgen 64B packets and on >> > TX side and, TC drop action on RX side compared to XDP fast drop. >> > >> > CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz >> > >> > Comparison is done between: >> > 1. Baseline, Before this patch with TC drop action >> > 2. This patch with TC drop action >> > 3. This patch with XDP RX fast drop >> > >> > Streams Baseline(TC drop) TC drop XDP fast Drop >> > -------------------------------------------------------------- >> > 1 5.51Mpps 5.14Mpps 13.5Mpps >> >> This (13.5 M PPS) is less than 50% of the result we presented @ the >> XDP summit which was obtained by Rana. Please see if/how much does >> this grows if you use more sender threads, but all of them to xmit the >> same stream/flows, so we're on one ring. That (XDP with single RX ring >> getting packets from N remote TX rings) would be your canonical >> base-line for any further numbers. > > Well, my experiments with this hardware (mlx5/CX4 at 50Gbit/s) show > that you should be able to reach 23Mpps on a single CPU. This is > a XDP-drop-simulation with order-0 pages being recycled through my > page_pool code, plus avoiding the cache-misses (notice you are using a > CPU E5-2680 with DDIO, thus you should only see a L3 cache miss).
so this takes up from 13M to 23M, good. Could you explain why the move from order-3 to order-0 is hurting the performance so much (drop from 32M to 23M), any way we can overcome that? > The 23Mpps number looks like some HW limitation, as the increase was not HW, I think. As I said, Rana got 32M with striding RQ when she was using order-3 (or did we use order-5?) > is not proportional to page-allocator overhead I removed (and CPU freq > starts to decrease). I also did scaling tests to more CPUs, which > showed it scaled up to 40Mpps (you reported 45M). And at the Phy RX > level I see 60Mpps (50G max is 74Mpps).