Hi Alec, Thanks for your response Here are my answers to your questions/remarks. I didn’t insert them inline but extracted the Q & A to keep a clear reading
Good bye, Pierrick [alec] The reference NIC that we use is Intel X710, XL710 and XXV710 (10G, 25G and 40G) and we have not seen any particular issues with rerunning benchmarks over and over again using the same TRex instance. Intel X520 is clearly not a preferred NIC because it misses lots of features only available in newer generation NIC (offload). I may be able to get a hold on a Mellanox 25G NIC soon and this report is definitely a good step to optimize on that NIC. [pierrick] OK, the X520 NIC is not our reference card neither... but I had no X710 card installed at the moment. I made some reference tests to look at possible software common mode issues (i.e. maybe NIC independent). As regards the Mellanox Connect 5 card,our present challenge was rather to get a high rate generator using only one dual-port NIC, if possible. Actually we couldn't succeed in reaching the same performances with a XXV710, even using the T-Rex sever alone. (With 64 bytes packets, bi-directional: Connect 5 <=> 2 x 25 GBits/s | XXV710 <=> ~ 2 x 12 Gbit/s with T-Rex (~ 2 x 9.5 GBits/s with NFVbench 3.5.0). * * * [alec] Reboot of the server that runs nfvbench? [pierrick] Yes! this is the only way I found to come back to the "bad" initial condition <=> when NFVbench shows poorer performances. * * * [alec] Are you saying that once you launch and restart TRex, all future NFVbench runs (with the second instance of Trex) work better? Please explain this in detail (eg describe the exact steps in sequence that you performed). Eg. Reboot server Start nfvbench for first time (will launch TRex for the first time) = poor performance (quantify) Restart TRex Run same benchmark = good performance (quantify) [alec] Can you provide some numbers with/without cache? I have not tested the cache size option for STLScVmRaw but this is definitely something worth trying on Intel NIC X710 family. [pierrick] Here are reported our test cases and results using a dual-port NIC Mellanox Connect 5 - with or without a prior unique T-Rex server run/stop sequence 1) - Fresh reboot - Perform four series of NFVbench tests 2) - Fresh reboot - Start a T-Rex server: standalone launch, same release as embedded in the container (v2.59) but in the host /bin/bash ./t-rex-64 --no-scapy-server --iom 0 --cfg /etc/LLL.yaml -i -c 6 ./_t-rex-64 --no-scapy-server --iom 0 --cfg /etc/LLL.yaml -i -c 6 --mlx4-so --mlx5-so - Stop the T-Rex server - Perform four series of NFVbench tests NFVbench tests are the following x.1 Cmdline: nfvbench -c LLL.cfg --rate=100% --duration=30 --interval=1 --flow-count=10000 -fs=64 -scc=1 --cores=6 --extra-stats --cache-size=0 x.2 Cmdline: nfvbench -c LLL.cfg --rate=100% --duration=30 --interval=1 --flow-count=10000 -fs=64 -scc=1 --cores=6 --extra-stats --cache-size=10000 x.3 Cmdline: nfvbench -c LLL.cfg --rate=100% --duration=30 --interval=1 --flow-count=10000 -fs=64 -scc=1 --cores=6 --cache-size=0 x.4 Cmdline: nfvbench -c LLL.cfg --rate=100% --duration=30 --interval=1 --flow-count=10000 -fs=64 -scc=1 --cores=6 --cache-size=10000 cases 1.1 and 2.1 correspond to the default NFVbench behaviour v3.5.0 +------+-------+------------+---------------------------+------------------------+-----------------+---------------------------+------------------------+-----------------+-----------------+ | case | cache | flow stats | Requested TX Rate (bps) | Actual TX Rate (bps) | RX Rate (bps) | Requested TX Rate (pps) | Actual TX Rate (pps) | RX Rate (pps) | variability | +======+=======+============+===========================+========================+=================+===========================+========================+=================+=================+ | 1.1 | 0 | all | 50.0000 Gbps | 11.4679 Gbps | 11.4679 Gbps | 74,404,760 pps | 17,065,302 pps | 17,065,302 pps | +/- 0.5 Gbits/s | +------+-------+------------+---------------------------+------------------------+-----------------+---------------------------+------------------------+-----------------+-----------------+ | 1.2 | 10000 | all | 50.0000 Gbps | 10.3619 Gbps | 10.3619 Gbps | 74,404,760 pps | 15,419,544 pps | 15,419,544 pps | +/- 0.5 Gbits/s | +------+-------+------------+---------------------------+------------------------+-----------------+---------------------------+------------------------+-----------------+-----------------+ | 1.3 | 0 | latency | 50.0000 Gbps | 24.5055 Gbps | 24.5055 Gbps | 74,404,760 pps | 36,466,471 pps | 36,466,471 pps | +/-0.01 Gbits/s | +------+-------+------------+---------------------------+------------------------+-----------------+---------------------------+------------------------+-----------------+-----------------+ | 1.4 | 10000 | latency | 50.0000 Gbps | 50.0000 Gbps | 50.0000 Gbps | 74,404,760 pps | 74,404,762 pps | 74,404,762 pps | +/- 0 Gbits/s | +======+=======+============+===========================+========================+=================+===========================+========================+=================+=================+ | 2.1 | 0 | all | 50.0000 Gbps | 22.5319 Gbps | 13.9766 Gbps | 74,404,760 pps | 33,529,639 pps | 20,798,442 pps | +/- 0.3 Gbits/s | +------+-------+------------+---------------------------+------------------------+-----------------+---------------------------+------------------------+-----------------+-----------------+ | 2.2 | 10000 | all | 50.0000 Gbps | 50.0000 Gbps | 8.4314 Gbps | 74,404,760 pps | 74,404,762 pps | 12,546,747 pps | +/- 0.2 Gbits/s | +------+-------+------------+---------------------------+------------------------+-----------------+---------------------------+------------------------+-----------------+-----------------+ | 3.3 | 0 | latency | 50.0000 Gbps | 24.3776 Gbps | 24.3776 Gbps | 74,404,760 pps | 36,276,261 pps | 36,276,261 pps | +/- 0.2 Gbits/s | +------+-------+------------+---------------------------+------------------------+-----------------+---------------------------+------------------------+-----------------+-----------------+ | 2.4 | 10000 | latency | 50.0000 Gbps | 50.0000 Gbps | 50.0000 Gbps | 74,404,760 pps | 74,404,762 pps | 74,404,762 pps | +/- 0 Gbits/s | +======+=======+============+===========================+========================+=================+===========================+========================+=================+=================+ * * * alec] If you provide me the diff I can tell you what it does, [pierrick] in trex_gen.py: streams.append(STLStream(packet=pkt, flow_stats=STLFlowStats(pg_id=pg_id), mode=STLTXCont())) changed with a conditional expression: streams.append(STLStream(packet=pkt, flow_stats=STLFlowStats(pg_id=pg_id) if self.config.extra_stats else None, mode=STLTXCont())) * * * [alec] Is that with/without cache, with/without restart? As mentioned above, we have not seen such issue with Intel NIC (25G/40G). [pierrick] see results beforehand * * * alec] we actually had a mode to disable latency completely at one point but decided to always leave it on as we did not see any negative side effect. We can certainly reinstate the option to disable latency for runs that do not care about latency and prioritize highest throughput. [pierrick] I did it, for testing too. I guess, as you do, that the latency measurement in itself should not impair the throughput performance (latency packets have a low rate). * * * [alec] I would not use X520 to make any judgment because this NIC has shown to be hard to work with (to get consistent good numbers for all use cases). But this observation seems to indicate that flow stats for latency streams are costly. Flow stats are important in nfvbench because they allow us to measure exact packet accounting per chain. In case of drops we know exactly which chain(s) are dropping and in what direction. So maybe something to work with TRex team to optimize. [pierrick] c.f. my previous answer. Actually we don't really mind the X520 NIC neither. Our study was mainly based on the Mellanox Connect 5 device since we had temporarily set apart the Intel XXV710 for reason of its poor hardware performances. see https://fast.dpdk.org/doc/perf/DPDK_19_05_Intel_NIC_performance_report.pdf <=> one dual-port NIC cannot deal with the required 74.4 Mpkts/s rate (max is only 36.7 Mpkts/s). As soon as we have deployed a new 25 Gbits/s based infra we'll come back to trying the recommended XXV710 but don't expect much of it for the traffic generator side. We understand that flow stats can be precious, provided that they don't bring a narrow bottleneck. Anyway we were puzzled by some observations showing good performances though High Rate flow stats were active but alone. So you're right, there might be room for software optimization. * * * alec] This is great investigative work for a NIC that I have not used. What I would suggest is upstream the cache size option. Then I’ll be able to test it on the Intel NIC family. Worth considering upstreaming the no latency option as well. [pierrick] I am about to push the modifications I have brought to the code and mentioned before. I didn't make any structural changes but added some facilities for investigating. -------------------------------------------------------------------------------- [cid:image001.gif@01CF4791.BEC5FC70]<http://www.orange.com/> orange labs ORANGE/IMT/OLN/CISS/IIE/AMI Pierrick LOUIN 4, rue du Clos Courtel - BP 91226 35512 CESSON SEVIGNÉ cédex tél : +33 2 99 12 48 23 mob : +33 6 43 33 01 04 e-mail : pierrick.lo...@orange.com<mailto:pierrick.lo...@orange.com> -------------------------------------------------------------------------------- De : Alec Hothan (ahothan) [mailto:ahot...@cisco.com] Envoyé : jeudi 1 août 2019 18:06 À : LOUIN Pierrick TGI/OLN Cc : opnfv-tech-discuss@lists.opnfv.org; Pedersen Michael S Objet : Re: [opnfv-tech-discuss] #nfvbench - Looking for line rate performances using NFVbench - UPDATED Hi Pierrick, Thanks for sending this detailed report. I think the anomalies you are seeing may be due to the choice of NIC card. Not that the Mellanox NIC is bad but it is a NIC that not many people have tested NFVbench with yet. I am copying Michael as he may have used this same Mellanox NIC with NFVbench in a CNCF benchmarking project (VM and container benchmarking with k8s). More inline… From: <opnfv-tech-discuss@lists.opnfv.org> on behalf of "Pierrick Louin via Lists.Opnfv.Org" <pierrick.louin=orange....@lists.opnfv.org> Reply-To: "pierrick.lo...@orange.com" <pierrick.lo...@orange.com> Date: Wednesday, July 31, 2019 at 9:03 AM To: "Alec Hothan (ahothan)" <ahot...@cisco.com> Cc: "opnfv-tech-discuss@lists.opnfv.org" <opnfv-tech-discuss@lists.opnfv.org> Subject: [opnfv-tech-discuss] #nfvbench - Looking for line rate performances using NFVbench - UPDATED Hi Alec, I'm Pierrick Louin working in the team of François-Régis Menguy @orange-Labs. In our last experiments, using the NFVbench application, we came across some issues and a workaround in trying to reach high performances from the bench at a high line rate. Maybe you can help us understanding what happens in the following observations. The raw traces are joined as text files. (sorry for resending, I hope my subscription works at last) -------------------------------------------------------------------------------------------- CONFIGURATION & SOFTWARE ************************ Those reference tests are performed over a simple loopback link between the 2 ports of a NIC (either wired or through a hardware switch). We studied the two following cases: NIC Intel X520 (10 Gbits/s) NIC Mellanox ConnectX-5 (25 Gbits/s) Note that the list of the cpu threads reserved to the generator is not optimized. There is room for tuning... However this was not the point in these tests since, as we will see, the performance issue appears related to the RX process. [alec] The reference NIC that we use is Intel X710, XL710 and XXV710 (10G, 25G and 40G) and we have not seen any particular issues with rerunning benchmarks over and over again using the same TRex instance. Intel X520 is clearly not a preferred NIC because it misses lots of features only available in newer generation NIC (offload). I may be able to get a hold on a Mellanox 25G NIC soon and this report is definitely a good step to optimize on that NIC. The T-Rex generator will run with the same settings whether it is used standalone or wrapped in the NFVbench. TRex version: v2.59 NFVbench version: 3.5.0 (3.5.1.dev1) Warning: I found that the NFVbench performance are improved - in mellanox tests - provided that a T-Rex server has been launched/stopped once before and since the last reboot. Otherwise no way to obtain a 2 x 25 Gbits/s TX throughput but only ~ 2 x 5.2 Gbits/s with Mellanox. We have to investigate this T-Rex issue which is not addressed thereafter (some different initial module loading ?). [alec] Reboot of the server that runs nfvbench? Are you saying that once you launch and restart TRex, all future NFVbench runs (with the second instance of Trex) work better? Please explain this in detail (eg describe the exact steps in sequence that you performed). Eg. Reboot server Start nfvbench for first time (will launch TRex for the first time) = poor performance (quantify) Restart TRex Run same benchmark = good performance (quantify) We have slightly patched the NFVbench code in order to make some processing now optional and/or configurable from the command line (some for debugging purpose). --cores CMD_CORES Override the T-Rex 'cores' parameter --cache-size CACHE_SIZE Specify the FE cache size (default: 0, flow-count if < 0) --service-mode Enable T-Rex service mode --extra-stats Enable extra flow stats (on high load traffic) --no-latency-stats Disable flow stats for latency traffic --no-latency-streams Disable latency measurements (no streams) --ipdb-mask IPDB_MASK Allow specific breakpoints for the ipdb debugger TESTS ***** We consider the smallest packet size (64 bytes L2) in order to assess the maximum throughput achievable. Four tests are performed for each of the NICs tested - with a 100% and NDR rate. 1) Preliminary tests, performed with a basic scenario: 'pik.py' launched from a T-Rex console (derived from the script 'bench.py' coming with the T-Rex appli). 2) Tests performed using NFVbench (v3.5.0) in its native coding: High-rate generic streams (for BW measurement) and Low-rate streams (for Latency assessment) are configured into the T-Rex generator in order to allow stats computing on transmitted/received streams. Actually we had however left a 10000 hard coded cache_size specification when calling STLScVmRaw() in ‘trex_gen.py’ in all the cases even this one. => This caching mode allows far better performances. (In our code release, we now control this parameter from a command line parameter) [alec] Can you provide some numbers with/without cache? I have not tested the cache size option for STLScVmRaw but this is definitely something worth trying on Intel NIC X710 family. 3) Tests performed using NFVbench where we have disabled into the 'trex_gen.py' script the instructions to tag the traffic generated for the purpose of further statistics (as far as we understand it) - this change is made in calls to STLStream(). [alec] If you provide me the diff I can tell you what it does, 4) Tests performed using NFVbench where we keep the flow stats property for the latency streams only. FIRST ANALYSIS ************ The T-Rex test allows us to check that we have no bottleneck on the generator/analyzer + SUT side. The NFVbench results show acceptable performances only when dealing with the 10 Gbits/sec line. Using the NFVbench with its unmodified behaviour (case 2) the line rate is far from being reached with a 50 Gpbs line: => 8.56 Gbits/s instead of 50 Gbits/s (L1) [alec] Is that with/without cache, with/without restart? As mentioned above, we have not seen such issue with Intel NIC (25G/40G). Unless there are some special reasons for activating a heavy flow stats RX processing, we suggest working in the case (4). We keep the latency assessment while the traffic counters seem to suffice at measuring the BW performances. Of course it may depend on the NIC capabilities for offloading traffic measurements. This is why we also make the flow stat activation optional. [alec] we actually had a mode to disable latency completely at one point but decided to always leave it on as we did not see any negative side effect. We can certainly reinstate the option to disable latency for runs that do not care about latency and prioritize highest throughput. FURTHER ANALYSIS *************** However, looking closer to the performances obtained in the case (3), we can see a significantly reduced rate at the TX (and therefore RX) side: 19.09 Gbits/s instead of 20 Gbtts/s (L1) - value is stable between launches 48.42 Gbits/s instead of 50 Gbits/s (L1) - actually variable between launches Note that the NDR measurement does not show any warning then. This is not our target case but it made me thinking... Thus, I tried a (5) case where we keep the flow stats for BW streams only. - the TX packet rate is reduced as in the case (3) for the Intel x520 - the throughput is limited by the line rate for the Mellanox ConnectX-5 <=> this is unexpected regarding to our hypotheses. [alec] I would not use X520 to make any judgment because this NIC has shown to be hard to work with (to get consistent good numbers for all use cases). But this observation seems to indicate that flow stats for latency streams are costly. Flow stats are important in nfvbench because they allow us to measure exact packet accounting per chain. In case of drops we know exactly which chain(s) are dropping and in what direction. So maybe something to work with TRex team to optimize. CONCLUSION *********** It looks like we are missing something in our comprehension. Not sure that our workaround would not hide some side effects. So far, we can use it for our present needs. => At least we succeeded at proving that 2x10 & 2x25 Gbits/sec rate line performances can be achieved using NFVBench. Waiting for reading from you. [alec] This is great investigative work for a NIC that I have not used. What I would suggest is upstream the cache size option. Then I’ll be able to test it on the Intel NIC family. Worth considering upstreaming the no latency option as well. Thanks Alec _________________________________________________________________________________________________________________________ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified. Thank you.
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#23424): https://lists.opnfv.org/g/opnfv-tech-discuss/message/23424 Mute This Topic: https://lists.opnfv.org/mt/32688323/21656 Mute #nfvbench: https://lists.opnfv.org/mk?hashtag=nfvbench&subid=2783016 Group Owner: opnfv-tech-discuss+ow...@lists.opnfv.org Unsubscribe: https://lists.opnfv.org/g/opnfv-tech-discuss/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-