It is actually pretty surprising and amazing that you see an almost linear 
scale when using hyper threads the way you do (pairing the 2 sibling threads on 
1 phys internet queue and 1 vhost queue). It is more common to see a factor of 
1.2 to 1.4 when combining 2 sibling threads (as compared to using 1 core 
without hyper-threading). I think even Intel could not hope to see such a good 
efficiency ;-)
It will be difficult to replicate this in a real openstack node given that the 
number of vhost interfaces will likely be larger than the number of cores 
assigned to the vswitch.

I’m not sure how the CSIT test configures the cores and I suspect it is as you 
describe.

Thanks
   Alec



From: Karl Rister <kris...@redhat.com>
Organization: Red Hat
Reply-To: "kris...@redhat.com" <kris...@redhat.com>
Date: Monday, February 20, 2017 at 11:29 AM
To: Thomas F Herbert <therb...@redhat.com>, "Alec Hothan (ahothan)" 
<ahot...@cisco.com>, "Maciek Konstantynowicz (mkonstan)" <mkons...@cisco.com>
Cc: Andrew Theurer <atheu...@redhat.com>, Douglas Shakshober 
<dsh...@redhat.com>, "csit-...@lists.fd.io" <csit-...@lists.fd.io>, vpp-dev 
<vpp-dev@lists.fd.io>, "Michael Pedersen -X (michaped - Intel at Cisco)" 
<micha...@cisco.com>
Subject: Re: [vpp-dev] Interesting perf test results from Red Hat's test team

On 02/20/2017 08:43 AM, Thomas F Herbert wrote:
On 02/17/2017 06:18 PM, Alec Hothan (ahothan) wrote:

Hi Karl



Can you also tell which version of DPDK you were using for OVS and for
VPP (for VPP is it the one bundled with 17.01?).

DPDK 1611 and VPP 1701.



“The pps is the bi-directional sum of the packets received back at the
traffic generator.”

Just to make sure….



If your traffic gen sends 1 Mpps to  each of the 2 interfaces and you
get no drop (meaning you receive 1 Mpps from each interface). What do
you report? 2 Mpps or 4 Mpps?

2 Mpps


You seem to say 2Mpps (sum of all RX).



The CSIT perf numbers report the sum(TX) = in the above example CSIT
reports 2 Mpps.

The CSIT numbers for 1 vhost/1 VM (practically similar to yours) are
at about half of what you report.



<https://docs.fd.io/csit/rls1701/report/vpp_performance_results_hw/performance_results_hw.html#ge2p1x520-dot1q-l2xcbase-eth-2vhost-1vm-ndrpdrdisc>https://docs.fd.io/csit/rls1701/report/vpp_performance_results_hw/performance_results_hw.html#ge2p1x520-dot1q-l2xcbase-eth-2vhost-1vm-ndrpdrdisc





scroll down the table to tc13 tc14, 4t4c (4 threads) L2XC, 64B NDR,
5.95Mpps (aggregated TX of the 2 interfaces) PDR 7.47Mpps.

while the results in your slides put it at around 11Mpps.



So either your testbed really switches 2 times more packets than the
CSIT one, or you’re actually reporting double the amount compared to
how CSIT reports it…

tc13 and tc14 both say "4 threads, 4 phy cores, 2 receive queues per NIC
port".

In our configuration when doing doing 2 queue we are actually using 8
CPU threads on 4 cores -- a dpdk thread on one core thread and a
vhost-user thread on the other core thread.  Our comparison of 1 thread
per core versus 2 threads per core (slide 3) showed that very little
performance was lost we packing the threads onto the cores in this way.

For tc13 and tc14 I assume that each thread is polling on both the dpdk
and vhost-user interfaces at the same time, is that accurate?  If so
that is a lot different than our test where each thread is only polling
a single interface.

Attached is a dump of some vppctl command output that hopefully shows
exactly how our setup is configured.




Thanks



  Alec







     *From: *Karl Rister <kris...@redhat.com<mailto:kris...@redhat.com>>
     *Organization: *Red Hat
     *Reply-To: *"kris...@redhat.com<mailto:kris...@redhat.com>" 
<kris...@redhat.com<mailto:kris...@redhat.com>>
     *Date: *Thursday, February 16, 2017 at 11:09 AM
     *To: *"Alec Hothan (ahothan)" 
<ahot...@cisco.com<mailto:ahot...@cisco.com>>, "Maciek
     Konstantynowicz (mkonstan)" 
<mkons...@cisco.com<mailto:mkons...@cisco.com>>, Thomas F Herbert
     <therb...@redhat.com<mailto:therb...@redhat.com>>
     *Cc: *Andrew Theurer <atheu...@redhat.com<mailto:atheu...@redhat.com>>, 
Douglas Shakshober
     <dsh...@redhat.com<mailto:dsh...@redhat.com>>, 
"csit-...@lists.fd.io<mailto:csit-...@lists.fd.io>"
     <csit-...@lists.fd.io<mailto:csit-...@lists.fd.io>>, vpp-dev 
<vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>>
     *Subject: *Re: [vpp-dev] Interesting perf test results from Red
     Hat's test team



     On 02/15/2017 08:58 PM, Alec Hothan (ahothan) wrote:



         Great summary slides Karl, I have a few more questions on the
         slides.



         ·         Did you use OSP10/OSPD/ML2 to deploy your testpmd
         VM/configure

         the vswitch or is it direct launch using libvirt and direct
         config of

         the vswitches? (this is a bit related to Maciek’s question on
         the exact

         interface configs in the vswitch)



     There was no use of OSP in these tests, the guest is launched via

     libvirt and the vswitches are manually launched and configured with

     shell scripts.



         ·         Unclear if all the charts results were measured
         using 4 phys

         cores (no HT) or 2 phys cores (4 threads with HT)



     Only the slide 3 has any 4 core (no HT) data, all other data is
     captured

     using HT on the appropriate number of cores: 2 for single queue, 4 for

     two queue, and 6 for three queue.



         ·         How do you report your pps? ;-) Are those

         o   vswitch centric (how many packets the vswitch forwards per
         second

         coming from traffic gen and from VMs)

         o   or traffic gen centric aggregated TX (how many pps are
         sent by the

         traffic gen on both interfaces)

         o   or traffic gen centric aggregated TX+RX (how many pps are
         sent and

         received by the traffic gen on both interfaces)



     The pps is the bi-directional sum of the packets received back at the

     traffic generator.



         ·         From the numbers shown, it looks like it is the
         first or the last

         ·         Unidirectional or symmetric bi-directional traffic?



     symmetric bi-directional



         ·         BIOS Turbo boost enabled or disabled?



     disabled



         ·         How many vcpus running the testpmd VM?



     3, 5, or 7.  1 VCPU for house keeping and then 2 VCPUs for each queue

     configuration.  Only the required VCPUs are active for any

     configuration, so the VCPU count varies depending on the configuration

     being tested.



         ·         How do you range the combinations in your 1M flows
         src/dest

         MAC? I’m not aware about any real NFV cloud deployment/VNF
         that handles

         that type of flow pattern, do you?



     We increment all the fields being modified by one for each packet
     until

     we hit a million and then we restart at the base value and repeat.  So

     all IPs and/or MACs get modified in unison.



     We actually arrived at the srcMac,dstMac configuration in a backwards

     manner.  On one of our systems where we develop the traffic
     generator we

     were getting an error when doing srcMac,dstMac,srcIp,dstIp that we

     couldn't figure out in the time needed for this work so we were
     going to

     just go with srcMac,dstMac due to time constraints.  However, on the

     system where we actually did the testing both worked so I just
     collected

     both out of curiosity.





         Thanks



            Alec





              *From:
         
*<<mailto:vpp-dev-boun...@lists.fd.io>vpp-dev-boun...@lists.fd.io<mailto:vpp-dev-boun...@lists.fd.io%3evpp-dev-boun...@lists.fd.io>>
         on behalf of "Maciek

              Konstantynowicz (mkonstan)"
         
<<mailto:mkons...@cisco.com>mkons...@cisco.com<mailto:mkons...@cisco.com%3emkons...@cisco.com>>

              *Date: *Wednesday, February 15, 2017 at 1:28 PM

              *To: *Thomas F Herbert
         
<<mailto:therb...@redhat.com>therb...@redhat.com<mailto:therb...@redhat.com%3etherb...@redhat.com>>

              *Cc: *Andrew Theurer
         
<<mailto:atheu...@redhat.com>atheu...@redhat.com<mailto:atheu...@redhat.com%3eatheu...@redhat.com>>,
 Douglas
         Shakshober

              <dsh...@redhat.com<mailto:dsh...@redhat.com> 
<mailto:dsh...@redhat.com><mailto:dsh...@redhat.com%3e>>,
         "csit-...@lists.fd.io<mailto:csit-...@lists.fd.io> 
<mailto:csit-...@lists.fd.io><mailto:csit-...@lists.fd.io%3e>"
         <csit-...@lists.fd.io<mailto:csit-...@lists.fd.io> 
<mailto:csit-...@lists.fd.io><mailto:csit-...@lists.fd.io%3e>>,

              vpp-dev
         
<<mailto:vpp-dev@lists.fd.io>vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io%3evpp-...@lists.fd.io>>,
 Karl Rister
         <kris...@redhat.com<mailto:kris...@redhat.com> 
<mailto:kris...@redhat.com><mailto:kris...@redhat.com%3e>>

              *Subject: *Re: [vpp-dev] Interesting perf test results
         from Red

              Hat's test team



              Thomas, many thanks for sending this.



              Few comments and questions after reading the slides:



              1. s3 clarification - host and data plane thread setup -
         vswitch pmd

              (data plane) thread placement

                  a. "1PMD/core (4 core)” - HT (SMT) disabled, 4 phy
         cores used

              for vswitch, each with data plane thread.

                  b. “2PMD/core (2 core)” - HT (SMT) enabled, 2 phy
         cores, 4

              logical cores used for vswitch, each with data plane thread.

                  c. in both cases each data plane thread handling a single

              interface - 2* physical, 2* vhost => 4 threads, all busy.

                  d. in both cases frames are dropped by vswitch or in
         vring due

              to vswitch not keeping up - IOW testpmd in kvm guest is
         not DUT.

              2. s3 question - vswitch setup - it is unclear what is the

              forwarding mode of each vswitch, as only srcIp changed in
         flows

                  a. flow or MAC learning mode?

                  b. port to port crossconnect?

              3. s3 comment - host and data plane thread setup

                  a. “2PMD/core (2 core)” case - thread placement may yield

              different results

                      - physical interface threads as siblings vs.

                      - physical and virtual interface threads as siblings.

                  b. "1PMD/core (4 core)” - one would expect these to
         be much

              higher than “2PMD/core (2 core)”

                      - speculation: possibly due to "instruction load"
         imbalance

              between threads.

                      - two types of thread with different "instruction
         load":

              phy->vhost vs. vhost->phy

                      - "instruction load" = instr/pkt, instr/cycle
         (IPC efficiency).

              4. s4 comment - results look as expected for vpp

              5. s5 question - unclear why throughput doubled

                  a. e.g. for vpp from "11.16 Mpps" to "22.03 Mpps"

                  b. if only queues increased, and cpu resources did
         not, or have

              they?

              6. s6 question - similar to point 5. - unclear cpu and thread

              reasources.

              7. s7 comment - anomaly for 3q (virtio multi-queue) for
         (srcMAc,dstMAC)

                  a. could be due to flow hashing inefficiency.

              -Maciek



                  On 15 Feb 2017, at 17:34, Thomas F Herbert
         <therb...@redhat.com<mailto:therb...@redhat.com> 
<mailto:therb...@redhat.com>


         
<<mailto:therb...@redhat.com%3e>mailto:therb...@redhat.com><mailto:therb...@redhat.com%3e%3emailto:therb...@redhat.com%3e>>
         wrote:



                  Here are test results on VPP 17.01 compared with OVS/DPDK

                  2.6/1611 performed by Karl Rister of Red Hat.

                  This is PVP testing with 1, 2 and 3 queues. It is an
         interesting

                  comparison with the CSIT results. Of particular
         interest is the

                  drop off on the 3 queue results.

                  --TFH



                  --

                  *Thomas F Herbert*

                  SDN Group

                  Office of Technology

                  *Red Hat*


         
<vpp-17.01_vs_ovs-2.6.pdf>_______________________________________________

                  vpp-dev mailing list

                  vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io> 
<mailto:vpp-dev@lists.fd.io>
         <mailto:vpp-dev@lists.fd.io>

                  https://lists.fd.io/mailman/listinfo/vpp-dev
         <https://lists.fd.io/mailman/listinfo/vpp-dev>







     --

     Karl Rister 
<<mailto:kris...@redhat.com>kris...@redhat.com<mailto:kris...@redhat.com%3ekris...@redhat.com>>



--
*Thomas F Herbert*
SDN Group
Office of Technology
*Red Hat*


--
Karl Rister <kris...@redhat.com<mailto:kris...@redhat.com>>

_______________________________________________
vpp-dev mailing list
vpp-dev@lists.fd.io
https://lists.fd.io/mailman/listinfo/vpp-dev

Reply via email to