(Removed bloat-lists to avoid cross ML-posting) On Mon, 4 Dec 2017 18:19:09 +0100 Matthias Tafelmeier <matthias.tafelme...@gmx.net> wrote:
> Hello, > > Scaling up to more CPUs and TCP-stream, Tariq[1] and I have showed the > > Linux kernel network stack scales to 94Gbit/s (linerate minus overhead). > > But when the drivers page-recycler fails, we hit bottlenecks in the > > page-allocator, that cause negative scaling to around 43Gbit/s. > > > > [1] > > http://lkml.kernel.org/r/cef85936-10b2-5d76-9f97-cb03b418f...@mellanox.com > > > > Linux have for a _long_ time been doing 10Gbit/s TCP-stream easily, on > > a SINGLE CPU. This is mostly thanks to TSO/GRO aggregating packets, > > but last couple of years the network stack have been optimized (with > > UDP workloads), and as a result we can do 10G without TSO/GRO on a > > single-CPU. This is "only" 812Kpps with MTU size frames. > > Cannot find the reference anymore, but there was once some workshop held > by you during some netdev where you were stating that you're practially > in rigorous exchange with NIC vendors as to having them tremendously > increase the RX/TX rings(queues) numbers. You are mis-quoting me. I have not recommended tremendously increasing the RX/TX rings(queues) numbers. Actually, we should likely decrease number of RX-rings, per recommendation of Eric Dumazet[1], to increase the chance of packet aggregation/bulking during NAPI-loop. And use something like CPUMAP[2] to re-distribute load on CPUs. [1] https://www.netdevconf.org/2.1/papers/BusyPollingNextGen.pdf [2] https://git.kernel.org/torvalds/c/452606d6c9cd You might have heard/seen me talk about increasing the ring queue size. that is the frames/pages available per RX-ring queue[3][4]. I generally don't recommend increasing that too much, as it hurts cache-usage. The real reason it sometimes helps to increase the RX-ring size on the Intel based NICs is because they intermix page-recycling into their RX-ring, which I now added a counter for when it fails[5]. [3] http://netoptimizer.blogspot.dk/2014/10/unlocked-10gbps-tx-wirespeed-smallest.html [4] http://netoptimizer.blogspot.dk/2014/06/pktgen-for-network-overload-testing.html [5] https://git.kernel.org/torvalds/c/86e23494222f3 > Further, that there are hardly > any limits to the number other than FPGA magic/physical HW - up to > millions is viable was coined back then. May I ask were this ended up? > Wouldn't that be key for massive parallelization either - With having a > queue(producer), a CPU (consumer) - vice versa - per flow at the > extreme? Did this end up in this SMART-NIC thingummy? The latter is > rather trageted at XDP, no? I do have future plans for (wanting drivers to support) dynamically adding more RX-TX-queue-pairs. The general idea is to have NIC HW to filter packets per application into specific NIC queue number, which can be mapped directly into an application (and I want a queue-pair to allow the app to TX also). I actually imagine that we can do the application steering via XDP_REDIRECT. And by having application register user-pages, like AF_PACKET-V4, we can achieve zero-copy into userspace from XDP. A subtle trick here is that zero-copy only occurs if the RX-queue number match (XDP operating at driver ring level could know), meaning that NIC HW filter setup could happen async (but premapping userspace pages still have to happen upfront, before starting app/socket). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
pgpfVD6IsQSn4.pgp
Description: OpenPGP digital signature