Hi Roni,

Thanks for you fast answer; good to see you like the post - at least it
was worth it to not keep it for myself in a notebook :)

Yes, thanks for confirming my hypothesis around the offloading part.

For the rest, that's why I used only iperf3 with the same workload. And
flow table on OvS, though not mentioned in my email but I suppose you
saw on the blog post, is just a simple IP forwarding ruleset: 2 rules
matching on the src and dst IP addresses back and forth + ARP flooding
for brevity.
So, there is no effect of any complex pipeline or header
matching/tunneling.

Regarding the cache, just to be on the same page, previously (~4-5
years earlier), NICs supporting OVS offload (like Mellanox CX-4/5)
supported offloading the Megaflow cache to the HW. So, the NIC could be
considered as an additional flow cache; megaflows that are used
frequently are offloaded to the NIC.
I know it's also not the same as I or you said before about offloading
the cache...so yes, cache is not present when you offload the whole DP
to the hardware, however, previously we could offload cache entries
only to the hardware.

Anyway, thank you again, Roni.
I try to measure my whole setup with DPDK and pktgen, instead of iperf
- then, I am in control of everything :)

On Mon, 2021-05-03 at 17:55 +0300, scimdi king wrote:
> Hi Levi,
> Very impressive work. few points I can think of:
> 
> I think iperf3 is single threaded and this means that working with
> parallel
> mode will not increase the overall throughput on the sending side, it
> will help
> OVS workload. maybe the iperf3 is a bottleneck here.
> 
> High throughput is achieved by big packet sizes, and this reduces the
> PPS.
> OVS SW (like HW), has a work per packet, so lower packet sizes might
> show
> the difference here between HW and SW. The packet rate on 25G with
> mtu of
> 1500 is about 2500000, if for example you will use 128 bytes it will
> be 24MPPS.
> In this simple scenario HW can do with no problem while SW I guess
> won't (with
> 4 cores as you described).
> 
> Another point is concurrency, OVS-DPDK has EMC that has a very nice
> impact
> on a low concurrency. high concurrency even thousands (I think) will
> drop the
> performance.
> 
> The last point is complexity of the processing, for example if you
> have to
> decap and modify mac it will have higher impact on SW performance,
> but I guess
> in this blog you wanted to focus on the most simplified flow.
> 
> On Mon, May 3, 2021 at 1:47 PM Levente Csikor
> <levente.csi...@gmail.com> wrote:
> > 
> > Hi,
> > 
> > I have been playing around with OvS(-DPDK) for a while, and
> > nowadays, I
> > am investigating its performance on SmartNICs.
> > More precisely, the recent SoC-based Mellanox / NVIDIA Bluefield-2
> > SmartNIC (or DPU how NVIDIA started to call its product) heavily
> > uses
> > an OVS(-DPDK) running on its ARM cores when it processes packets
> > from
> > and to the host system (in SmartNIC mode).
> > 
> > On this SmartNIC, the OVS kernel datapath can be offloaded to the
> > hardware with TC flowers. If OVS-DPDK is running on the SmartNIC,
> > it
> > can also be offloaded to the hardware - essentially, it is done via
> > DPDK rte_flow (according to this OVS Conf talk -
> > https://www.openvswitch.org/support/ovscon2019/day2/0951-hw_offload_ovs_con_19-Oz-Mellanox.pdf
> > )
> > 
> > So, even though the different offloading approaches, when the
> > datapath
> > is offloaded to the hardware and all packets are processed by the
> > hardware exclusively, the performance should be the same, right?
> 
> Basically yes, however OVS-DPDK and OVS-Kernel are a bit different in
> some use cases. For example when using VXLAN the data-path rules will
> look different. In the presented example it should be the same AFAIK.
> 
> > In other words, while OVS-DPDK performs much better than the kernel
> > datapath running on a host system, once offloaded, they are
> > essentially
> > the same as the same "hardware block" that implements the
> > corresponding
> > (part of the) datapath. Does this interpretation make sense?
> > 
> 
> Yes. it works the same on the basic stuff.
> 
> > Or, since the megaflow cache algorithm is slightly different in
> > each
> > implementation, in some corner cases (like in the discrepancy of
> > the
> > megaflow cache presentation -
> > https://www.youtube.com/watch?v=DSC3m-Bww64), the DPDK-based
> > offloading
> > should perform better?
> > 
> 
> The caches only affect the SW, so if a flow is offloaded and you
> generate an attack,
> I should not affect the throughput at all, because all is in HW.
> 
> Thanks,
> Roni
> 
> > 
> > More information about my measurements (from which this question
> > has
> > been born) can be found in a blogpost:
> > https://medium.com/codex/nvidia-mellanox-bluefield-2-smartnic-hands-on-tutorial-rig-for-dive-part-vii-1417e2e625bf
> > 
> > Thank you,
> > Levi
> > 
> > _______________________________________________
> > discuss mailing list
> > disc...@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to