Dear Rubina, I have adjusted my trex config to match yours - increased 2m to 4m, but that still didn't change much.
The only thing your config could be adapted is to have "deny" instead of "drop" in the ACL configuration - right now one of your ACLs does not have any rules, it is best to avoid this config. *However*, I have spotted another difference between our setups, which turns out important. in your "show acl-plugin sessions" the session creation/deletion is concentrated on a single worker, whereas in my case it is handled pretty evenly by two of them. This is my distribution of the workers to the interfaces during the test run (short timeouts, the total session count is hovering just under 300K): vpp# show int rx-placement Thread 1 (vpp_wk_0): node dpdk-input: TenGigabitEthernet81/0/0 queue 0 (polling) Thread 2 (vpp_wk_1): node dpdk-input: TenGigabitEthernet81/0/1 queue 0 (polling) vpp# If I do this change: vpp# set interface rx-placement TenGigabitEthernet81/0/1 worker 0 vpp# show int rx-placement Thread 1 (vpp_wk_0): node dpdk-input: TenGigabitEthernet81/0/0 queue 0 (polling) TenGigabitEthernet81/0/1 queue 0 (polling) vpp# then the session count climbs to 1m relatively quickly, and when it reaches 1m, we stop creating new sessions and forwarding the traffic for the new sessions. The connections still do clean up periodically, but the cleanup rate is too slow. In my setup I run both the t-rex and VPP on the same machine, different cores - so I would expect that if you have two different machines, this overload effect is more pronounced. When I stop the traffic, the session count slowly goes down to 0, and if I change back the rx-placement of the interfaces to be the same as before, then I can again run the test successfully for a long time. So, seems like the cleaner node does not cope when there is a higher load. The way I intended the cleanup mechanism to work, is to have up to this number of connections cleaned up by the workers during the (variable) interrupt interval, and if we need to clean up more, the interrupt interval should be reduced by half. TCP-style feedback loop, basically, which should self-balance at around the current connection cleanup rate. To investigate further I had tried this change: diff --git a/src/plugins/acl/acl.h b/src/plugins/acl/acl.h index 07ed868..98291c5 100644 --- a/src/plugins/acl/acl.h +++ b/src/plugins/acl/acl.h @@ -242,7 +242,7 @@ typedef struct { * of connections, it halves the sleep time. */ -#define ACL_FA_DEFAULT_MAX_DELETED_SESSIONS_PER_INTERVAL 100 +#define ACL_FA_DEFAULT_MAX_DELETED_SESSIONS_PER_INTERVAL 10000 u64 fa_max_deleted_sessions_per_interval; /* And the previously "stuck" tests worked fine - the sessions were being cleaned up, and the session count was again hovering at about 300K, even with both interfaces serviced by the same worker. So, it seems like either my code of the interrupt sending is misbehaving, or the threads under loads don't get enough interrupts. Assuming it's the same issue we are seeing. So, could you please do as follows: 0) run the multicore test (short timeout), which will fail, then stop the traffic and wait for session count to go to 0 1) rebalance the interfaces onto different workers as above(this merely lowers the load per-worker, hence the multicore test) 2) retry the multicore test. It may or may not run successfully. Would be interesting to know the result. 3) apply the diff above, rebuild vpp and repeat the steps 0-2. This will help us ensure that I have indeed reproduced the same issue as seen in your setup. Thanks a lot! --a On 3/13/18, Rubina Bianchi <r_bian...@outlook.com> wrote: > Dear Andrew > > My Trex config is uploaded; I also tested the scenario with your Trex > config. > The stability of vpp in your run is strange. When I run this scenario, vpp > crashes in my DUT machine after about 200 second of running Trex. > In this period I see #del sessions is 0 until session pool becomes full, > after that session deletion starts. But its rate is lower than the one I see > when I run vpp on single core. > > Could you please check my configs once again for any misconfiguration? > Is vpp or dpdk compatible or incompatible with any specified device? > > Thanks, > Sincerely > > Sent from Outlook<http://aka.ms/weboutlook> > ________________________________ > From: Andrew 👽 Yourtchenko <ayour...@gmail.com> > Sent: Monday, March 12, 2018 1:50 PM > To: Rubina Bianchi > Cc: vpp-dev@lists.fd.io > Subject: Re: [vpp-dev] Freezing Session Deletion Operation > > Dear Rubina, > > I've tried the test locally using the data that you sent, here is the > output from my trex after 10 minutes running: > > -Per port stats table > ports | 0 | 1 > ----------------------------------------------------------------------------------------- > opackets | 312605970 | 312191927 > obytes | 100919855857 | 174147108346 > ipackets | 311329098 | 277120788 > ibytes | 173666531289 | 76492053900 > ierrors | 0 | 0 > oerrors | 0 | 0 > Tx Bw | 1.17 Gbps | 2.01 Gbps > > -Global stats enabled > Cpu Utilization : 21.2 % 30.0 Gb/core > Platform_factor : 1.0 > Total-Tx : 3.18 Gbps > Total-Rx : 2.89 Gbps > Total-PPS : 901.93 Kpps > Total-CPS : 13.52 Kcps > > Expected-PPS : 901.92 Kpps > Expected-CPS : 13.53 Kcps > Expected-BPS : 3.18 Gbps > > Active-flows : 8883 Clients : 255 Socket-util : 0.0553 % > Open-flows : 9425526 Servers : 65535 Socket : 8883 > Socket/Clients : 34.8 > drop-rate : 0.00 bps > current time : 702.8 sec > test duration : 2897.2 sec > > So, in my setup worked I could not see the behavior you describe... > > But we have at least one more thing that may be different between our > setups - is the trex config. > > Here is what mine looks like: > > - version: 2 > interfaces: ['03:00.0', '03:00.1'] > port_limit: 2 > memory: > dp_flows: 2000000 > port_info: > - ip: 1.1.1.1 > default_gw: 1.1.1.2 > - ip: 1.1.1.2 > default_gw: 1.1.1.1 > > > Could you send me your trex config to see if that might be the > difference between our setups, so I could try it locally ? > > Thanks! > > --a > > On 3/12/18, Rubina Bianchi <r_bian...@outlook.com> wrote: >> Hi Dear Andrew >> >> I repeated once again my scenarios with short timeouts and upload all >> configs and outputs for your consideration. >> I am clear about that session cleaner process doesn't work properly and >> my >> Trex throughput stuck at 0. >> Please repeat this scenario to verify this (Unfortunately vpp is just >> stable >> for 200 second and after that vpp will be down). >> >> Thanks, >> Sincerely >> >> Sent from Outlook<http://aka.ms/weboutlook> >> ________________________________ >> From: Andrew 👽 Yourtchenko <ayour...@gmail.com> >> Sent: Sunday, March 11, 2018 3:48 PM >> To: Rubina Bianchi >> Cc: vpp-dev@lists.fd.io >> Subject: Re: [vpp-dev] Freezing Session Deletion Operation >> >> Hi Rubina, >> >> I am assuming you are observing this both in single core and multicore >> scenario ? >> >> Based on the outputs, this is what I think might be going on: >> >> I am seeing the total# of sessions is 1000000, and no TCP transient >> sessions - thus the packets that require a a session are dropped. >> >> What is a bit peculiar, is that the session delete# per-worker are >> non-zero, yet the the delete counters are zero. To me this indicates >> there was a fair bit of transient sessions, which also then got >> recycled by the TCP sessions properly established, before the idle >> timeout has expired. >> >> And at the moment of taking the show command output the connection >> cleaner activity has not yet kicked in - I do not see either any >> session deleted by idle timeout nor its timer restarted. Which makes >> me think that the time interval in which you are testing must be >> relatively short... >> >> So, assuming the time between the start of the traffic and the time >> you have 1m sessions is quite short, this is simply using up all of >> the connection pool, a classic inherent resource management issue with >> any stateful scenario. >> >> You can verify that the sessions delete and start building again if >> you issue "clear acl-plugin sessions". >> >> Also, changing the session timeouts to more aggressive values (say, 10 >> seconds), should kick off the aggressive connection cleaning, thus >> should unlock this condition. Of course, shorter idle time means >> potentially useful connections removed. (the commands are "set >> acl-plugin session timeout <udp|tcp> idle <X>"). >> >> *if* neither of the above does not adequately describe what you are >> seeing, the cleaner node >> may for whatever reason ceases to kick in every half a second. >> >> To see the dynamics of conn cleaner node, you can use the debug command >> "set acl-plugin session event-trace 1" before the start of the test. >> This will produce the event trace, which you can view by "show >> event-logger all" - this should give a reasonable idea about what the >> cleaner node is up to. >> >> Please let me know. >> >> --a >> >> >> >> >> >> On 3/11/18, Rubina Bianchi <r_bian...@outlook.com> wrote: >>> Hi, >>> >>> I am testing vpp_18.01.1-124~g302ef0b5 (commit: >>> 696e0da1cde7e2bcda5a781f9ad286672de8dcbf) and >>> vpp_18.04-rc0~322-g30787378 >>> (commit: 30787378f49714319e75437b347b7027a949700d) using Trex with sfr >>> scenario in one core and multicore state. >>> After a while I saw session deletion rate decreases and vpp throughput >>> becomes 0 bps. >>> All configuration files and outputs are attached. >>> >>> Thanks, >>> Sincerely >>> >>> Sent from Outlook<http://aka.ms/weboutlook> >>> >> > -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#8517): https://lists.fd.io/g/vpp-dev/message/8517 View All Messages In Topic (6): https://lists.fd.io/g/vpp-dev/topic/14475974 Mute This Topic: https://lists.fd.io/mt/14475974/21656 New Topic: https://lists.fd.io/g/vpp-dev/post Change Your Subscription: https://lists.fd.io/g/vpp-dev/editsub/21656 Group Home: https://lists.fd.io/g/vpp-dev Contact Group Owner: vpp-dev+ow...@lists.fd.io Terms of Service: https://lists.fd.io/static/tos Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub -=-=-=-=-=-=-=-=-=-=-=-