Hello VPP experts. I have encountered an unexpected pattern in performance results. So I wonder, is there a bug somewhere (VPP or CSIT), or is there a subtle reason why the performance pattern should be expected?
Usually, the more processing VPP in a particular test has to do, the smaller forwarding rate it achieves. That is why l2patch tests are usually the fastest. (I am talking about MRR here; so maximal offered load, and we measure the rate of packets making it through back to Traffic Generator). But I noticed this stops being true in some cases. Specifically on Cascadelake testbeds, when VPP uses 4 physical cores (HT on, so 8 VPP workers), l2patch no longer has the best MRR. (Also seen on Skylake.) After some examination, I selected two tests to compare. Bidirectional traffic, single Intel-xxv710 NIC, two ports (one per direction), 4 receive queues per port (9 transmit queues as automatically selected by VPP), AVF driver, and both tests use l2 cross-connect. The difference being one test handles dot1q on top of that. Even though the dot1q test has larger [1] vectors/call than the other [2] (expected, as the loop with dot1q has more work to do), and dot1q shows small number of rx discards [3] (the other shows none [4]), while neither showing anything bad in "show error", still the dot1q forwards above 35 Mpps (see Message here [5] for the 10 trial results), compared to the other test's under 31 Mpps [6]. The same pattern can be seen in other tests, although there usually are details differing between the dot1q and plain tests. For example with DPDK driver, there is not-that-small amount of "rx missed", the dot1q already showing smaller number [7] than the other [8]. (I have even seen tx error in l2patch tests, but that could be a separate issue.) So, have you seen this behavior? Can you explain it? The only guess I have is that the faster test is polling rx queues more frequently, and that somehow slows down the NICs ability to actually receive packets fast enough? I give that less than 1% probability of explaining the difference. The workers are loaded in a fairly uniform way, so that is not an issue. Perhaps something in dot1q handling makes it inherently cheaper for NIC to process? Even then, I do not think that explains why dot1q-l2xc becomes faster than l2patch. Vratko. [1] https://logs.fd.io/production/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-2n-clx/153/archives/log.html.gz#s1-s1-s1-s1-s1-t1-k2-k10-k1-k1-k4-k1 [2] https://logs.fd.io/production/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-2n-clx/153/archives/log.html.gz#s1-s1-s1-s1-s2-t1-k2-k9-k1-k1-k4-k1 [3] https://logs.fd.io/production/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-2n-clx/153/archives/log.html.gz#s1-s1-s1-s1-s1-t1-k2-k10-k1-k8-k1 [4] https://logs.fd.io/production/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-2n-clx/153/archives/log.html.gz#s1-s1-s1-s1-s2-t1-k2-k9-k1-k8-k1 [5] https://logs.fd.io/production/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-2n-clx/153/archives/log.html.gz#s1-s1-s1-s1-s1-t1 [6] https://logs.fd.io/production/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-2n-clx/153/archives/log.html.gz#s1-s1-s1-s1-s2-t1 [7] https://logs.fd.io/production/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-2n-clx/153/archives/log.html.gz#s1-s1-s1-s1-s3-t1-k2-k10-k1-k8-k1 [8] https://logs.fd.io/production/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-2n-clx/153/archives/log.html.gz#s1-s1-s1-s1-s4-t1-k2-k9-k1-k8-k1
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#16041): https://lists.fd.io/g/vpp-dev/message/16041 Mute This Topic: https://lists.fd.io/mt/72901093/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-