This thread topology is one of the highest priority requests that our product team is placing on our test efforts. We would like to be able to rely on the numbers that CSIT is generating but this concept in particular is a sticking point. We want to know the best per-core throughput and our testing so far has shown that HT is a significant piece of achieving that.
I agree that the results are surprising, my working theory that I have not yet investigated is comprised of 2 factors: 1) that there is a lot of hardware/memory induced instruction latency that allows the siblings to share the core very efficiently 2) processing the hardware and virtual queues hits different and/or replicated functional units allowing more efficient sharing Karl On 02/27/2017 11:58 AM, Alec Hothan (ahothan) wrote: > > > It is actually pretty surprising and amazing that you see an almost > linear scale when using hyper threads the way you do (pairing the 2 > sibling threads on 1 phys internet queue and 1 vhost queue). It is more > common to see a factor of 1.2 to 1.4 when combining 2 sibling threads > (as compared to using 1 core without hyper-threading). I think even > Intel could not hope to see such a good efficiency ;-) > > It will be difficult to replicate this in a real openstack node given > that the number of vhost interfaces will likely be larger than the > number of cores assigned to the vswitch. > > > > I’m not sure how the CSIT test configures the cores and I suspect it is > as you describe. > > > > Thanks > > Alec > > > > > > > > *From: *Karl Rister <kris...@redhat.com> > *Organization: *Red Hat > *Reply-To: *"kris...@redhat.com" <kris...@redhat.com> > *Date: *Monday, February 20, 2017 at 11:29 AM > *To: *Thomas F Herbert <therb...@redhat.com>, "Alec Hothan > (ahothan)" <ahot...@cisco.com>, "Maciek Konstantynowicz (mkonstan)" > <mkons...@cisco.com> > *Cc: *Andrew Theurer <atheu...@redhat.com>, Douglas Shakshober > <dsh...@redhat.com>, "csit-...@lists.fd.io" <csit-...@lists.fd.io>, > vpp-dev <vpp-dev@lists.fd.io>, "Michael Pedersen -X (michaped - > Intel at Cisco)" <micha...@cisco.com> > *Subject: *Re: [vpp-dev] Interesting perf test results from Red > Hat's test team > > > > On 02/20/2017 08:43 AM, Thomas F Herbert wrote: > > On 02/17/2017 06:18 PM, Alec Hothan (ahothan) wrote: > > > > Hi Karl > > > > > > > > Can you also tell which version of DPDK you were using for > OVS and for > > VPP (for VPP is it the one bundled with 17.01?). > > > > DPDK 1611 and VPP 1701. > > > > > > > > “The pps is the bi-directional sum of the packets received > back at the > > traffic generator.” > > > > Just to make sure…. > > > > > > > > If your traffic gen sends 1 Mpps to each of the 2 > interfaces and you > > get no drop (meaning you receive 1 Mpps from each > interface). What do > > you report? 2 Mpps or 4 Mpps? > > > > 2 Mpps > > > > > > You seem to say 2Mpps (sum of all RX). > > > > > > > > The CSIT perf numbers report the sum(TX) = in the above > example CSIT > > reports 2 Mpps. > > > > The CSIT numbers for 1 vhost/1 VM (practically similar to > yours) are > > at about half of what you report. > > > > > > > > > <https://docs.fd.io/csit/rls1701/report/vpp_performance_results_hw/performance_results_hw.html#ge2p1x520-dot1q-l2xcbase-eth-2vhost-1vm-ndrpdrdisc>https://docs.fd.io/csit/rls1701/report/vpp_performance_results_hw/performance_results_hw.html#ge2p1x520-dot1q-l2xcbase-eth-2vhost-1vm-ndrpdrdisc > > > > > > > > > > > > scroll down the table to tc13 tc14, 4t4c (4 threads) L2XC, > 64B NDR, > > 5.95Mpps (aggregated TX of the 2 interfaces) PDR 7.47Mpps. > > > > while the results in your slides put it at around 11Mpps. > > > > > > > > So either your testbed really switches 2 times more packets > than the > > CSIT one, or you’re actually reporting double the amount > compared to > > how CSIT reports it… > > > > tc13 and tc14 both say "4 threads, 4 phy cores, 2 receive queues per NIC > > port". > > > > In our configuration when doing doing 2 queue we are actually using 8 > > CPU threads on 4 cores -- a dpdk thread on one core thread and a > > vhost-user thread on the other core thread. Our comparison of 1 thread > > per core versus 2 threads per core (slide 3) showed that very little > > performance was lost we packing the threads onto the cores in this way. > > > > For tc13 and tc14 I assume that each thread is polling on both the dpdk > > and vhost-user interfaces at the same time, is that accurate? If so > > that is a lot different than our test where each thread is only polling > > a single interface. > > > > Attached is a dump of some vppctl command output that hopefully shows > > exactly how our setup is configured. > > > > > > > > > > Thanks > > > > > > > > Alec > > > > > > > > > > > > > > > > *From: *Karl Rister <kris...@redhat.com > <mailto:kris...@redhat.com>> > > *Organization: *Red Hat > > *Reply-To: *"kris...@redhat.com > <mailto:kris...@redhat.com>" <kris...@redhat.com > <mailto:kris...@redhat.com>> > > *Date: *Thursday, February 16, 2017 at 11:09 AM > > *To: *"Alec Hothan (ahothan)" <ahot...@cisco.com > <mailto:ahot...@cisco.com>>, "Maciek > > Konstantynowicz (mkonstan)" <mkons...@cisco.com > <mailto:mkons...@cisco.com>>, Thomas F Herbert > > <therb...@redhat.com <mailto:therb...@redhat.com>> > > *Cc: *Andrew Theurer <atheu...@redhat.com > <mailto:atheu...@redhat.com>>, Douglas Shakshober > > <dsh...@redhat.com <mailto:dsh...@redhat.com>>, > "csit-...@lists.fd.io <mailto:csit-...@lists.fd.io>" > > <csit-...@lists.fd.io <mailto:csit-...@lists.fd.io>>, > vpp-dev <vpp-dev@lists.fd.io <mailto:vpp-dev@lists.fd.io>> > > *Subject: *Re: [vpp-dev] Interesting perf test results > from Red > > Hat's test team > > > > > > > > On 02/15/2017 08:58 PM, Alec Hothan (ahothan) wrote: > > > > > > > > Great summary slides Karl, I have a few more > questions on the > > slides. > > > > > > > > · Did you use OSP10/OSPD/ML2 to deploy your > testpmd > > VM/configure > > > > the vswitch or is it direct launch using libvirt > and direct > > config of > > > > the vswitches? (this is a bit related to Maciek’s > question on > > the exact > > > > interface configs in the vswitch) > > > > > > > > There was no use of OSP in these tests, the guest is > launched via > > > > libvirt and the vswitches are manually launched and > configured with > > > > shell scripts. > > > > > > > > · Unclear if all the charts results were > measured > > using 4 phys > > > > cores (no HT) or 2 phys cores (4 threads with HT) > > > > > > > > Only the slide 3 has any 4 core (no HT) data, all other > data is > > captured > > > > using HT on the appropriate number of cores: 2 for > single queue, 4 for > > > > two queue, and 6 for three queue. > > > > > > > > · How do you report your pps? ;-) Are those > > > > o vswitch centric (how many packets the vswitch > forwards per > > second > > > > coming from traffic gen and from VMs) > > > > o or traffic gen centric aggregated TX (how many > pps are > > sent by the > > > > traffic gen on both interfaces) > > > > o or traffic gen centric aggregated TX+RX (how > many pps are > > sent and > > > > received by the traffic gen on both interfaces) > > > > > > > > The pps is the bi-directional sum of the packets > received back at the > > > > traffic generator. > > > > > > > > · From the numbers shown, it looks like it > is the > > first or the last > > > > · Unidirectional or symmetric > bi-directional traffic? > > > > > > > > symmetric bi-directional > > > > > > > > · BIOS Turbo boost enabled or disabled? > > > > > > > > disabled > > > > > > > > · How many vcpus running the testpmd VM? > > > > > > > > 3, 5, or 7. 1 VCPU for house keeping and then 2 VCPUs > for each queue > > > > configuration. Only the required VCPUs are active for any > > > > configuration, so the VCPU count varies depending on > the configuration > > > > being tested. > > > > > > > > · How do you range the combinations in your > 1M flows > > src/dest > > > > MAC? I’m not aware about any real NFV cloud > deployment/VNF > > that handles > > > > that type of flow pattern, do you? > > > > > > > > We increment all the fields being modified by one for > each packet > > until > > > > we hit a million and then we restart at the base value > and repeat. So > > > > all IPs and/or MACs get modified in unison. > > > > > > > > We actually arrived at the srcMac,dstMac configuration > in a backwards > > > > manner. On one of our systems where we develop the traffic > > generator we > > > > were getting an error when doing > srcMac,dstMac,srcIp,dstIp that we > > > > couldn't figure out in the time needed for this work so > we were > > going to > > > > just go with srcMac,dstMac due to time > constraints. However, on the > > > > system where we actually did the testing both worked so > I just > > collected > > > > both out of curiosity. > > > > > > > > > > > > Thanks > > > > > > > > Alec > > > > > > > > > > > > *From: > > > *<<mailto:vpp-dev-boun...@lists.fd.io>vpp-dev-boun...@lists.fd.io > > <mailto:vpp-dev-boun...@lists.fd.io%3evpp-dev-boun...@lists.fd.io>> > > on behalf of "Maciek > > > > Konstantynowicz (mkonstan)" > > <<mailto:mkons...@cisco.com>mkons...@cisco.com > <mailto:mkons...@cisco.com%3emkons...@cisco.com>> > > > > *Date: *Wednesday, February 15, 2017 at 1:28 PM > > > > *To: *Thomas F Herbert > > <<mailto:therb...@redhat.com>therb...@redhat.com > <mailto:therb...@redhat.com%3etherb...@redhat.com>> > > > > *Cc: *Andrew Theurer > > <<mailto:atheu...@redhat.com>atheu...@redhat.com > <mailto:atheu...@redhat.com%3eatheu...@redhat.com>>, Douglas > > Shakshober > > > > <dsh...@redhat.com <mailto:dsh...@redhat.com> > <mailto:dsh...@redhat.com> <mailto:dsh...@redhat.com%3e>>, > > "csit-...@lists.fd.io <mailto:csit-...@lists.fd.io> > <mailto:csit-...@lists.fd.io> <mailto:csit-...@lists.fd.io%3e>" > > <csit-...@lists.fd.io <mailto:csit-...@lists.fd.io> > <mailto:csit-...@lists.fd.io> <mailto:csit-...@lists.fd.io%3e>>, > > > > vpp-dev > > <<mailto:vpp-dev@lists.fd.io>vpp-dev@lists.fd.io > <mailto:vpp-dev@lists.fd.io%3evpp-...@lists.fd.io>>, Karl Rister > > <kris...@redhat.com <mailto:kris...@redhat.com> > <mailto:kris...@redhat.com> <mailto:kris...@redhat.com%3e>> > > > > *Subject: *Re: [vpp-dev] Interesting perf test > results > > from Red > > > > Hat's test team > > > > > > > > Thomas, many thanks for sending this. > > > > > > > > Few comments and questions after reading the > slides: > > > > > > > > 1. s3 clarification - host and data plane > thread setup - > > vswitch pmd > > > > (data plane) thread placement > > > > a. "1PMD/core (4 core)” - HT (SMT) > disabled, 4 phy > > cores used > > > > for vswitch, each with data plane thread. > > > > b. “2PMD/core (2 core)” - HT (SMT) > enabled, 2 phy > > cores, 4 > > > > logical cores used for vswitch, each with data > plane thread. > > > > c. in both cases each data plane thread > handling a single > > > > interface - 2* physical, 2* vhost => 4 > threads, all busy. > > > > d. in both cases frames are dropped by > vswitch or in > > vring due > > > > to vswitch not keeping up - IOW testpmd in kvm > guest is > > not DUT. > > > > 2. s3 question - vswitch setup - it is unclear > what is the > > > > forwarding mode of each vswitch, as only srcIp > changed in > > flows > > > > a. flow or MAC learning mode? > > > > b. port to port crossconnect? > > > > 3. s3 comment - host and data plane thread setup > > > > a. “2PMD/core (2 core)” case - thread > placement may yield > > > > different results > > > > - physical interface threads as > siblings vs. > > > > - physical and virtual interface > threads as siblings. > > > > b. "1PMD/core (4 core)” - one would expect > these to > > be much > > > > higher than “2PMD/core (2 core)” > > > > - speculation: possibly due to > "instruction load" > > imbalance > > > > between threads. > > > > - two types of thread with different > "instruction > > load": > > > > phy->vhost vs. vhost->phy > > > > - "instruction load" = instr/pkt, > instr/cycle > > (IPC efficiency). > > > > 4. s4 comment - results look as expected for vpp > > > > 5. s5 question - unclear why throughput doubled > > > > a. e.g. for vpp from "11.16 Mpps" to > "22.03 Mpps" > > > > b. if only queues increased, and cpu > resources did > > not, or have > > > > they? > > > > 6. s6 question - similar to point 5. - unclear > cpu and thread > > > > reasources. > > > > 7. s7 comment - anomaly for 3q (virtio > multi-queue) for > > (srcMAc,dstMAC) > > > > a. could be due to flow hashing inefficiency. > > > > -Maciek > > > > > > > > On 15 Feb 2017, at 17:34, Thomas F Herbert > > <therb...@redhat.com <mailto:therb...@redhat.com> > <mailto:therb...@redhat.com> > > > > > > > <<mailto:therb...@redhat.com%3e>mailto:therb...@redhat.com> > <mailto:therb...@redhat.com%3e%3emailto:therb...@redhat.com%3e>> > > wrote: > > > > > > > > Here are test results on VPP 17.01 > compared with OVS/DPDK > > > > 2.6/1611 performed by Karl Rister of Red Hat. > > > > This is PVP testing with 1, 2 and 3 > queues. It is an > > interesting > > > > comparison with the CSIT results. Of > particular > > interest is the > > > > drop off on the 3 queue results. > > > > --TFH > > > > > > > > -- > > > > *Thomas F Herbert* > > > > SDN Group > > > > Office of Technology > > > > *Red Hat* > > > > > > > > <vpp-17.01_vs_ovs-2.6.pdf>_______________________________________________ > > > > vpp-dev mailing list > > > > vpp-dev@lists.fd.io > <mailto:vpp-dev@lists.fd.io> <mailto:vpp-dev@lists.fd.io> > > <mailto:vpp-dev@lists.fd.io> > > > > https://lists.fd.io/mailman/listinfo/vpp-dev > > <https://lists.fd.io/mailman/listinfo/vpp-dev> > > > > > > > > > > > > > > > > -- > > > > Karl Rister > <<mailto:kris...@redhat.com>kris...@redhat.com > <mailto:kris...@redhat.com%3ekris...@redhat.com>> > > > > > > > > -- > > *Thomas F Herbert* > > SDN Group > > Office of Technology > > *Red Hat* > > > > > > -- > > Karl Rister <kris...@redhat.com <mailto:kris...@redhat.com>> > > > -- Karl Rister <kris...@redhat.com> _______________________________________________ vpp-dev mailing list vpp-dev@lists.fd.io https://lists.fd.io/mailman/listinfo/vpp-dev