Hi Alec,
Thanks for your response

Here are my answers to your questions/remarks.
I didn’t insert them inline but extracted the Q & A to keep a clear reading

Good bye,
Pierrick

[alec]
The reference NIC that we use is Intel X710, XL710 and XXV710 (10G, 25G and 
40G) and we have not seen any particular issues with rerunning benchmarks over 
and over again using the same TRex instance.
Intel X520 is clearly not a preferred NIC because it misses lots of features 
only available in newer generation NIC (offload).
I may be able to get a hold on a Mellanox 25G NIC soon and this report is 
definitely a good step to optimize on that NIC.

[pierrick]
OK, the X520 NIC is not our reference card neither... but I had no X710 card 
installed at the moment.
I made some reference tests to look at possible software common mode issues 
(i.e. maybe NIC independent).

As regards the Mellanox Connect 5 card,our present challenge was rather to get 
a high rate generator using only one dual-port NIC, if possible.
Actually we couldn't succeed in reaching the same performances with a XXV710, 
even using the T-Rex sever alone.
(With 64 bytes packets, bi-directional: Connect 5 <=> 2 x 25 GBits/s | XXV710 
<=> ~ 2 x 12 Gbit/s with T-Rex (~ 2 x 9.5 GBits/s with NFVbench 3.5.0).
   *
*   *
[alec]
Reboot of the server that runs nfvbench?

[pierrick]
Yes! this is the only way I found to come back to the "bad" initial condition 
<=> when NFVbench shows poorer performances.
   *
*   *
[alec]
Are you saying that once you launch and restart TRex, all future NFVbench runs 
(with the second instance of Trex) work better? Please explain this in detail 
(eg describe the exact steps in sequence that you performed).
Eg.
Reboot server
Start nfvbench for first time (will launch TRex for the first time) = poor 
performance (quantify)
Restart TRex
Run same benchmark = good performance (quantify)
[alec]
Can you provide some numbers with/without cache?
I have not tested the cache size option for STLScVmRaw but this is definitely 
something worth trying on Intel NIC X710 family.

[pierrick]
Here are reported our test cases and results using a dual-port NIC Mellanox 
Connect 5 - with or without a prior unique T-Rex server run/stop sequence


1)      - Fresh reboot

- Perform four series of NFVbench tests


2)      -  Fresh reboot

- Start a T-Rex server:  standalone launch,  same release as embedded in the 
container (v2.59) but in the host
/bin/bash ./t-rex-64 --no-scapy-server --iom 0 --cfg /etc/LLL.yaml -i -c 6
./_t-rex-64 --no-scapy-server --iom 0 --cfg /etc/LLL.yaml -i -c 6 --mlx4-so 
--mlx5-so

- Stop the T-Rex server

- Perform four series of NFVbench tests

NFVbench tests are the following
     x.1 Cmdline: nfvbench -c LLL.cfg --rate=100% --duration=30 --interval=1 
--flow-count=10000 -fs=64 -scc=1 --cores=6 --extra-stats --cache-size=0
     x.2 Cmdline: nfvbench -c LLL.cfg --rate=100% --duration=30 --interval=1 
--flow-count=10000 -fs=64 -scc=1 --cores=6 --extra-stats --cache-size=10000
     x.3 Cmdline: nfvbench -c LLL.cfg --rate=100% --duration=30 --interval=1 
--flow-count=10000 -fs=64 -scc=1 --cores=6 --cache-size=0
     x.4 Cmdline: nfvbench -c LLL.cfg --rate=100% --duration=30 --interval=1 
--flow-count=10000 -fs=64 -scc=1 --cores=6 --cache-size=10000

cases 1.1 and 2.1 correspond to the default NFVbench behaviour v3.5.0

+------+-------+------------+---------------------------+------------------------+-----------------+---------------------------+------------------------+-----------------+-----------------+
| case | cache | flow stats |  Requested TX Rate (bps)  |  Actual TX Rate (bps) 
 |  RX Rate (bps)  |  Requested TX Rate (pps)  |  Actual TX Rate (pps)  |  RX 
Rate (pps)  |   variability   |
+======+=======+============+===========================+========================+=================+===========================+========================+=================+=================+
|  1.1 |   0   |    all     |       50.0000 Gbps        |      11.4679 Gbps     
 |  11.4679 Gbps   |      74,404,760 pps       |     17,065,302 pps     | 
17,065,302 pps  | +/- 0.5 Gbits/s |
+------+-------+------------+---------------------------+------------------------+-----------------+---------------------------+------------------------+-----------------+-----------------+
|  1.2 | 10000 |    all     |       50.0000 Gbps        |      10.3619 Gbps     
 |  10.3619 Gbps   |      74,404,760 pps       |     15,419,544 pps     | 
15,419,544 pps  | +/- 0.5 Gbits/s |
+------+-------+------------+---------------------------+------------------------+-----------------+---------------------------+------------------------+-----------------+-----------------+
|  1.3 |   0   |  latency   |       50.0000 Gbps        |      24.5055 Gbps     
 |  24.5055 Gbps   |      74,404,760 pps       |     36,466,471 pps     | 
36,466,471 pps  | +/-0.01 Gbits/s |
+------+-------+------------+---------------------------+------------------------+-----------------+---------------------------+------------------------+-----------------+-----------------+
|  1.4 | 10000 |  latency   |       50.0000 Gbps        |      50.0000 Gbps     
 |  50.0000 Gbps   |      74,404,760 pps       |     74,404,762 pps     | 
74,404,762 pps  | +/-   0 Gbits/s |
+======+=======+============+===========================+========================+=================+===========================+========================+=================+=================+
|  2.1 |   0   |    all     |       50.0000 Gbps        |      22.5319 Gbps     
 |  13.9766 Gbps   |      74,404,760 pps       |     33,529,639 pps     | 
20,798,442 pps  | +/- 0.3 Gbits/s |
+------+-------+------------+---------------------------+------------------------+-----------------+---------------------------+------------------------+-----------------+-----------------+
|  2.2 | 10000 |    all     |       50.0000 Gbps        |      50.0000 Gbps     
 |   8.4314 Gbps   |      74,404,760 pps       |     74,404,762 pps     | 
12,546,747 pps  | +/- 0.2 Gbits/s |
+------+-------+------------+---------------------------+------------------------+-----------------+---------------------------+------------------------+-----------------+-----------------+
|  3.3 |   0   |  latency   |       50.0000 Gbps        |      24.3776 Gbps     
 |  24.3776 Gbps   |      74,404,760 pps       |     36,276,261 pps     | 
36,276,261 pps  | +/- 0.2 Gbits/s |
+------+-------+------------+---------------------------+------------------------+-----------------+---------------------------+------------------------+-----------------+-----------------+
|  2.4 | 10000 |  latency   |       50.0000 Gbps        |      50.0000 Gbps     
 |  50.0000 Gbps   |      74,404,760 pps       |     74,404,762 pps     | 
74,404,762 pps  | +/-   0 Gbits/s |
+======+=======+============+===========================+========================+=================+===========================+========================+=================+=================+
   *
*   *
alec]
If you provide me the diff I can tell you what it does,

[pierrick]
in trex_gen.py:
                    streams.append(STLStream(packet=pkt,
                                             
flow_stats=STLFlowStats(pg_id=pg_id),
                                             mode=STLTXCont()))
changed with a conditional expression:
                    streams.append(STLStream(packet=pkt,
                                             
flow_stats=STLFlowStats(pg_id=pg_id)
                                                 if self.config.extra_stats 
else None,
                                             mode=STLTXCont()))
   *
*   *
[alec]
Is that with/without cache, with/without restart?
As mentioned above, we have not seen such issue with Intel NIC (25G/40G).

[pierrick]
see results beforehand
   *
*   *
alec]
we actually had a mode to disable latency completely at one point but decided 
to always leave it on as we did not see any negative side effect. We can 
certainly reinstate the option to disable latency for runs that do not care 
about latency and prioritize highest throughput.

[pierrick]
I did it, for testing too. I guess, as you do, that the latency measurement in 
itself should not impair the throughput performance (latency packets have a low 
rate).
   *
*   *
[alec]
I would not use X520 to make any judgment because this NIC has shown to be hard 
to work with (to get consistent good numbers for all use cases).
But this observation seems to indicate that flow stats for latency streams are 
costly.
Flow stats are important in nfvbench because they allow us to measure exact 
packet accounting per chain. In case of drops we know exactly which chain(s) 
are dropping and in what direction.
So maybe something to work with TRex team to optimize.

[pierrick]
c.f. my previous answer. Actually we don't really mind the X520 NIC neither.
Our study was mainly based on the Mellanox Connect 5 device since we had 
temporarily set apart the Intel XXV710 for reason of its poor hardware 
performances.
see https://fast.dpdk.org/doc/perf/DPDK_19_05_Intel_NIC_performance_report.pdf 
<=> one dual-port NIC cannot deal with the required 74.4 Mpkts/s rate (max is 
only 36.7 Mpkts/s).
As soon as we have deployed a new 25 Gbits/s based infra we'll come back to 
trying the recommended XXV710 but don't expect much of it for the traffic 
generator side.

We understand that flow stats can be precious, provided that they don't bring a 
narrow bottleneck.
Anyway we were puzzled by some observations showing good performances though 
High Rate flow stats were active but alone.
So you're right, there might be room for software optimization.
   *
*   *
alec]
This is great investigative work for a NIC that I have not used.
What I would suggest is upstream the cache size option. Then I’ll be able to 
test it on the Intel NIC family.
Worth considering upstreaming the no latency option as well.

[pierrick]
I am about to push the modifications I have brought to the code and mentioned 
before.
I didn't make any structural changes but added some facilities for 
investigating.

--------------------------------------------------------------------------------

[cid:image001.gif@01CF4791.BEC5FC70]<http://www.orange.com/>

orange labs

ORANGE/IMT/OLN/CISS/IIE/AMI


Pierrick LOUIN


4, rue du Clos Courtel  -  BP 91226

35512  CESSON SEVIGNÉ  cédex
tél      : +33 2 99 12 48 23
mob    : +33 6 43 33 01 04
e-mail : pierrick.lo...@orange.com<mailto:pierrick.lo...@orange.com>
--------------------------------------------------------------------------------

De : Alec Hothan (ahothan) [mailto:ahot...@cisco.com]
Envoyé : jeudi 1 août 2019 18:06
À : LOUIN Pierrick TGI/OLN
Cc : opnfv-tech-discuss@lists.opnfv.org; Pedersen Michael S
Objet : Re: [opnfv-tech-discuss] #nfvbench - Looking for line rate performances 
using NFVbench - UPDATED

Hi Pierrick,

Thanks for sending this detailed report.
I think the anomalies you are seeing may be due to the choice of NIC card. Not 
that the Mellanox NIC is bad but it is a NIC that not many people have tested 
NFVbench with yet.
I am copying Michael as he may have used this same Mellanox NIC with NFVbench 
in a CNCF benchmarking project (VM and container benchmarking with k8s).


More inline…


From: <opnfv-tech-discuss@lists.opnfv.org> on behalf of "Pierrick Louin via 
Lists.Opnfv.Org" <pierrick.louin=orange....@lists.opnfv.org>
Reply-To: "pierrick.lo...@orange.com" <pierrick.lo...@orange.com>
Date: Wednesday, July 31, 2019 at 9:03 AM
To: "Alec Hothan (ahothan)" <ahot...@cisco.com>
Cc: "opnfv-tech-discuss@lists.opnfv.org" <opnfv-tech-discuss@lists.opnfv.org>
Subject: [opnfv-tech-discuss] #nfvbench - Looking for line rate performances 
using NFVbench - UPDATED

Hi Alec,

I'm Pierrick Louin working in the team of François-Régis Menguy @orange-Labs.

In our last experiments, using the NFVbench application, we came across some 
issues and a workaround in trying to reach high performances from the bench at 
a high line rate.
Maybe you can help us understanding what happens in the following observations.

The raw traces are joined as text files.

(sorry for resending, I hope my subscription works at last)

--------------------------------------------------------------------------------------------

CONFIGURATION & SOFTWARE
************************

Those reference tests are performed over a simple loopback link between the 2 
ports of a NIC (either wired or through a hardware switch).
We studied the two following cases:

NIC Intel X520 (10 Gbits/s)
NIC Mellanox ConnectX-5 (25 Gbits/s)

Note that the list of the cpu threads reserved to the generator is not 
optimized. There is room for tuning...
However this was not the point in these tests since, as we will see, the 
performance issue appears related to the RX process.

[alec]
The reference NIC that we use is Intel X710, XL710 and XXV710 (10G, 25G and 
40G) and we have not seen any particular issues with rerunning benchmarks over 
and over again using the same TRex instance.
Intel X520 is clearly not a preferred NIC because it misses lots of features 
only available in newer generation NIC (offload).
I may be able to get a hold on a Mellanox 25G NIC soon and this report is 
definitely a good step to optimize on that NIC.


The T-Rex generator will run with the same settings whether it is used 
standalone or wrapped in the NFVbench.

TRex version: v2.59
NFVbench version: 3.5.0 (3.5.1.dev1)

Warning: I found that the NFVbench performance are improved - in mellanox tests 
- provided that a T-Rex server has been launched/stopped once before and since 
the last reboot.

Otherwise no way to obtain a 2 x 25 Gbits/s TX throughput but only ~ 2 x 5.2 
Gbits/s with Mellanox.
We have to investigate  this T-Rex issue which is not addressed thereafter 
(some different initial module loading ?).

[alec]
Reboot of the server that runs nfvbench?
Are you saying that once you launch and restart TRex, all future NFVbench runs 
(with the second instance of Trex) work better? Please explain this in detail 
(eg describe the exact steps in sequence that you performed).
Eg.
Reboot server
Start nfvbench for first time (will launch TRex for the first time) = poor 
performance (quantify)
Restart TRex
Run same benchmark = good performance (quantify)



We have slightly patched the NFVbench code in order to make some processing now 
optional and/or configurable from the command line (some for debugging purpose).

--cores CMD_CORES       Override the T-Rex 'cores' parameter
--cache-size CACHE_SIZE Specify the FE cache size (default: 0, flow-count if < 
0)
--service-mode          Enable T-Rex service mode
--extra-stats           Enable extra flow stats (on high load traffic)
--no-latency-stats      Disable flow stats for latency traffic
--no-latency-streams    Disable latency measurements (no streams)
--ipdb-mask IPDB_MASK   Allow specific breakpoints for the ipdb debugger

TESTS
*****

We consider the smallest packet size (64 bytes L2) in order to assess the 
maximum throughput achievable.

Four tests are performed for each of the NICs tested - with a 100% and NDR rate.

1) Preliminary tests, performed with a basic scenario: 'pik.py' launched from a 
T-Rex console (derived from the script 'bench.py' coming with the T-Rex appli).

2) Tests performed using NFVbench (v3.5.0) in its native coding:
     High-rate generic streams (for BW measurement) and Low-rate streams (for 
Latency assessment) are configured into the T-Rex generator in order to allow 
stats computing on transmitted/received streams.

Actually we had however left a 10000 hard coded cache_size specification when 
calling STLScVmRaw() in ‘trex_gen.py’ in all the cases even this one.



=>  This caching mode allows far better performances.

(In our code release, we now control this parameter from a command line 
parameter)

[alec]
Can you provide some numbers with/without cache?
I have not tested the cache size option for STLScVmRaw but this is definitely 
something worth trying on Intel NIC X710 family.


3) Tests performed using NFVbench where we have disabled into the 'trex_gen.py' 
script the instructions to tag the traffic generated for the purpose of further 
statistics (as far as we understand it) - this change is made in calls to 
STLStream().

[alec]
If you provide me the diff I can tell you what it does,


4) Tests performed using NFVbench where we keep the flow stats property for the 
latency streams only.

FIRST ANALYSIS
************

The T-Rex test allows us to check that we have no bottleneck on the 
generator/analyzer + SUT side.

The NFVbench results show acceptable performances only when dealing with the 10 
Gbits/sec line.

Using the NFVbench with its unmodified behaviour (case 2) the line rate is far 
from being reached with a 50 Gpbs line:


=>  8.56 Gbits/s instead of 50 Gbits/s (L1)

[alec]
Is that with/without cache, with/without restart?
As mentioned above, we have not seen such issue with Intel NIC (25G/40G).


Unless there are some special reasons for activating a heavy flow stats RX 
processing, we suggest working in the case (4).

We keep the latency assessment while the traffic counters seem to suffice at 
measuring the BW performances.

Of course it may depend on the NIC capabilities for offloading traffic 
measurements.
This is why we also make the flow stat activation optional.

[alec]
we actually had a mode to disable latency completely at one point but decided 
to always leave it on as we did not see any negative side effect. We can 
certainly reinstate the option to disable latency for runs that do not care 
about latency and prioritize highest throughput.


FURTHER ANALYSIS
***************

However, looking closer to the performances obtained in the case (3),
we can see a significantly reduced rate at the TX (and therefore RX) side:

19.09 Gbits/s instead of 20 Gbtts/s (L1) - value is stable between launches
48.42 Gbits/s instead of 50 Gbits/s (L1) - actually variable between launches

Note that the NDR measurement does not show any warning then.

This is not our target case but it made me thinking...

Thus, I tried a (5) case where we keep the flow stats for BW streams only.

  - the TX packet rate is reduced as in the case (3) for the Intel x520
  - the throughput is limited by the line rate for the Mellanox ConnectX-5

<=> this is unexpected regarding to our hypotheses.

[alec]
I would not use X520 to make any judgment because this NIC has shown to be hard 
to work with (to get consistent good numbers for all use cases).
But this observation seems to indicate that flow stats for latency streams are 
costly.
Flow stats are important in nfvbench because they allow us to measure exact 
packet accounting per chain. In case of drops we know exactly which chain(s) 
are dropping and in what direction.
So maybe something to work with TRex team to optimize.



CONCLUSION
***********

It looks like we are missing something in our comprehension.
Not sure that our workaround would not hide some side effects.
So far, we can use it for our present needs.



=>  At least we succeeded at proving that 2x10 & 2x25 Gbits/sec rate line 
performances can be achieved using NFVBench.

Waiting for reading from you.

[alec]
This is great investigative work for a NIC that I have not used.
What I would suggest is upstream the cache size option. Then I’ll be able to 
test it on the Intel NIC family.
Worth considering upstreaming the no latency option as well.

Thanks

  Alec



_________________________________________________________________________________________________________________________

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou 
falsifie. Merci.

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been 
modified, changed or falsified.
Thank you.

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#23424): 
https://lists.opnfv.org/g/opnfv-tech-discuss/message/23424
Mute This Topic: https://lists.opnfv.org/mt/32688323/21656
Mute #nfvbench: https://lists.opnfv.org/mk?hashtag=nfvbench&subid=2783016
Group Owner: opnfv-tech-discuss+ow...@lists.opnfv.org
Unsubscribe: https://lists.opnfv.org/g/opnfv-tech-discuss/unsub  
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to