Hi Kim,
I'll share key learnings about since I started to work on high speed
software networking in 2006, when everyone was laughing at me becaused I
claimed to achieve 10Gbps networking with a CPU.
CPU is less important than memory/QPI
On x86 memory subsytem include things like Cache Boxes, Home Agent, DRAM
controllers... Home Agent is reponsible to know on which CPU node is a
cacheline. So it can become a centralized bottleneck.... DRAM
controllers have a queue of pending DRAM requests (instruction pipeline,
data prefetch, data...). QPI routing may also severely impact
performance. I remember using a 4 socket system that was half the
performance of a 2 socket system because of either bad QPI routing
programing by the BIOS or a hardware issue.
An order of magnitude to keep in mind is that at 100Gbps, each 64-byte
packet and each associated 64-byte used metadata cacheline is consuming
roughly a full DRAM channel. As an example and not counting application
data to be leveraged (FIB, DNS database...) a 100Gbps DPDK bridging
application requires 3 memory channels per port (to reach line rate if
the IO allows it)... There is a lot more to say but I let you do your
own research ;-)
BTW, why would you want to do 100GBps line rate (or very close to it)?
To ensure that each node has the capacity to resist a DDoS attack
powered by DPDK/ODP/native "applications".
PCI is your ennemy (or not that a good friend)
PCI chipset behavior is complex. The typical payload on x86 is 256bytes.
So I assumed that using a 1KB max payload to support the average 670
byte internet packet size would give better results... But no, early DMA
transaction acknowledgement is disabled if payload not 256 so it dropped
performance significantly.
You may have an embedded switch on the NIC. So you think that offloading
will give you a benefit. Yes at low speed but you can't build a 50Gbps
service chain because most of the NIC are on PCI x8 Gen3 slots which is
limited to 50Gbps BW.
So the conclusion is: don't try to understand those limits, create a
testbed that really mimics the target "size" and topology of your use
case and measure.
Don't do tests at 10Gbps if your target is 100Gbps.
Starting at 50Gbps you will be bumping on PCI DMA transaction rate
barrier. Unless you have a smart IO model (multiple packets per DMA
transaction - see Netcope for instance) supported in zero-copy by the
SDK architecture you won't reach line rate or be able to have an
application (zero-copy of data or metadata reduction can save a DRAM
channel for application at this "speed"). I think (but not sure) you can
squeeze two packets in a buffer with Mellanox cards: that can be
instrumental in reaching 50Gbps line rate but I don't know if DPDK
supports this feature.
Don't do pps at the switch level if your target is fast VM application
behavior.
Measuring that a software switch can do 10Gbps line rate with 64 byte
packets does not help at all to predict TCP application performance in a
VM. Factors such as GRO/GSO support are more important as limiting
factor is TCP window opening.
I measured web traffic over IPSec links between VMs. The key performance
factor was latency of the switching/IPsec combo: if latency is above a
certain level, TCP window of the endpoints does not open and the
in-between software switches become under-utilized.
My vision is that if you use a hardware specific SDK to build your
hardware specific application, you will get the best of the hardware.
The gains can range from 30% to 100% depending on HW, so it is not
negligible (you may have to prove this assertion ;-). One major reason
being the ability to use the exact sotfware metadata which may become a
single cache line or even no software metadata at all as you could
leverage the hardware descriptor directly. The other reason is to
leverage the native IO model for the device which DPDK may not support.
The price to pay is hardware or vendor dependence.
FF
PS1: You may want to clarify your search: you haven't stated if your
interest is L2 switch or L3 switch, if you consider baremetal switching,
container or VM switching.
If you want L3 then you probably want to focus on VPP, Contrail or Snabb
rather than the low level packet io frameworks. With latest Intel AVF
technology, DPDK is almost irrelevant for VPP and actually slows things
down with the same hardware (Intel XL 710 card)
AAdditionally, the kernel community is working on AF_XDP which may be
relevant for your case.
PS2: I am not sure NANOG is the best list to discuss the technical
details you want. That said, it may be the best place to discuss the use
cases or realistic testbed setup.
On 04.06.2018 07:41, Kasper Adel wrote:
Hello
I’m asked to evaluate switching platforms that has different forwarding
chips but the same OS.
Assuming these vendors give the same SDK and similar
documentation/support,
then what would be comparison points to consider, other than the
obvious
(price, features, bps, pps).
I’m thinking, how do i validate their claims about capability to do
leaf/spine arch, ToR/Gateways, telemetry, serviceability, facilities to
troubleshoot packet drops or FIB programming misses, hidden tools...etc
It would be great if anyonw can give some thoughts around it, specially
if
you have tried one or both.
Thanks
Kim