RES: RES: [vpp-dev] Increasing NAT worker handoff frame queue size NAT_FQ_NELTS to avoid congestion drops?

Marcos - Mgiga Fri, 13 Nov 2020 09:22:27 -0800

Understood. And what path did you take in order to analyse and monitor vector 
rates ? Is there some specific command or log ?


Thanks

Marcos

-----Mensagem original-----
De: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> Em nome de ksekera via []
Enviada em: sexta-feira, 13 de novembro de 2020 14:02
Para: Marcos - Mgiga <mar...@mgiga.com.br>
Cc: Elias Rudberg <elias.rudb...@bahnhof.net>; vpp-dev@lists.fd.io
Assunto: Re: RES: [vpp-dev] Increasing NAT worker handoff frame queue size 
NAT_FQ_NELTS to avoid congestion drops?

Not completely idle, more like medium load. Vector rates at which I saw 
congestion drops were roughly 40 for thread doing no work (just handoffs - I 
hardcoded it this way for test purpose), and roughly 100 for thread picking the 
packets doing NAT.

What got me into infra investigation was the fact that once I was hitting 
vector rates around 255, I did see packet drops, but no congestion drops.

HTH,
Klement

> On 13 Nov 2020, at 17:51, Marcos - Mgiga <mar...@mgiga.com.br> wrote:
> 
> So you mean that this situation ( congestion drops) is most likely to occur 
> when the system in general is idle than when it is processing a large amount 
> of traffic?
> 
> Best Regards
> 
> Marcos
> 
> -----Mensagem original-----
> De: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> Em nome de Klement 
> Sekera via lists.fd.io Enviada em: sexta-feira, 13 de novembro de 2020 
> 12:15
> Para: Elias Rudberg <elias.rudb...@bahnhof.net>
> Cc: vpp-dev@lists.fd.io
> Assunto: Re: [vpp-dev] Increasing NAT worker handoff frame queue size 
> NAT_FQ_NELTS to avoid congestion drops?
> 
> Hi Elias,
> 
> I’ve already debugged this and came to the conclusion that it’s the infra 
> which is the weak link. I was seeing congestion drops at mild load, but not 
> at full load. Issue is that with handoff, there is uneven workload. For 
> simplicity’s sake, just consider thread 1 handing off all the traffic to 
> thread 2. What happens is that for thread 1, the job is much easier, it just 
> does some ip4 parsing and then hands packet to thread 2, which actually does 
> the heavy lifting of hash inserts/lookups/translation etc. 64 element queue 
> can hold 64 frames, one extreme is 64 1-packet frames, totalling 64 packets, 
> other extreme is 64 255-packet frames, totalling ~16k packets. What happens 
> is this: thread 1 is mostly idle and just picking a few packets from NIC and 
> every one of these small frames creates an entry in the handoff queue. Now 
> thread 2 picks one element from the handoff queue and deals with it before 
> picking another one. If the queue has only 3-packet or 10-packet elements, 
> then thread 2 can never really get into what VPP excels in - bulk processing.
> 
> Q: Why doesn’t it pick as many packets as possible from the handoff queue? 
> A: It’s not implemented.
> 
> I already wrote a patch for it, which made all congestion drops which I saw 
> (in above synthetic test case) disappear. Mentioned patch 
> https://gerrit.fd.io/r/c/vpp/+/28980 is sitting in gerrit.
> 
> Would you like to give it a try and see if it helps your issue? We 
> shouldn’t need big queues under mild loads anyway …
> 
> Regards,
> Klement
> 
>> On 13 Nov 2020, at 16:03, Elias Rudberg <elias.rudb...@bahnhof.net> wrote:
>> 
>> Hello VPP experts,
>> 
>> We are using VPP for NAT44 and we get some "congestion drops", in a 
>> situation where we think VPP is far from overloaded in general. Then 
>> we started to investigate if it would help to use a larger handoff 
>> frame queue size. In theory at least, allowing a longer queue could 
>> help avoiding drops in case of short spikes of traffic, or if it 
>> happens that some worker thread is temporarily busy for whatever 
>> reason.
>> 
>> The NAT worker handoff frame queue size is hard-coded in the 
>> NAT_FQ_NELTS macro in src/plugins/nat/nat.h where the current value 
>> is 64. The idea is that putting a larger value there could help.
>> 
>> We have run some tests where we changed the NAT_FQ_NELTS value from 
>> 64 to a range of other values, each time rebuilding VPP and running 
>> an identical test, a test case that is to some extent trying to mimic 
>> our real traffic, although of course it is simplified. The test runs 
>> many
>> iperf3 tests simultaneously using TCP, combined with some UDP traffic 
>> chosen to trigger VPP to create more new sessions (to make the NAT 
>> "slowpath" happen more).
>> 
>> The following NAT_FQ_NELTS values were tested:
>> 16
>> 32
>> 64  <-- current value
>> 128
>> 256
>> 512
>> 1024
>> 2048  <-- best performance in our tests
>> 4096
>> 8192
>> 16384
>> 32768
>> 65536
>> 131072
>> 
>> In those tests, performance was very bad for the smallest 
>> NAT_FQ_NELTS values of 16 and 32, while values larger than 64 gave 
>> improved performance. The best results in terms of throughput were 
>> seen for NAT_FQ_NELTS=2048. For even larger values than that, we got 
>> reduced performance compared to the 2048 case.
>> 
>> The tests were done for VPP 20.05 running on a Ubuntu 18.04 server 
>> with a 12-core Intel Xeon CPU and two Mellanox mlx5 network cards. 
>> The number of NAT threads was 8 in some of the tests and 4 in some of 
>> the tests.
>> 
>> According to these tests, the effect of changing NAT_FQ_NELTS can be 
>> quite large. For example, for one test case chosen such that 
>> congestion drops were a significant problem, the throughput increased 
>> from about 43 to 90 Gbit/second with the amount of congestion drops 
>> per second reduced to about one third. In another kind of test, 
>> throughput increased by about 20% with congestion drops reduced to 
>> zero. Of course such results depend a lot on how the tests are 
>> constructed. But anyway, it seems clear that the choice of 
>> NAT_FQ_NELTS value can be important and that increasing it would be 
>> good, at least for the kind of usage we have tested now.
>> 
>> Based on the above, we are considering changing NAT_FQ_NELTS from 64 
>> to a larger value and start trying that in our production environment 
>> (so far we have only tried it in a test environment).
>> 
>> Were there specific reasons for setting NAT_FQ_NELTS to 64?
>> 
>> Are there some potential drawbacks or dangers of changing it to a 
>> larger value?
>> 
>> Would you consider changing to a larger value in the official VPP 
>> code?
>> 
>> Best regards,
>> Elias
>> 
>> 
>> 
>> 
> 
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#18019): https://lists.fd.io/g/vpp-dev/message/18019
Mute This Topic: https://lists.fd.io/mt/78234440/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

RES: RES: [vpp-dev] Increasing NAT worker handoff frame queue size NAT_FQ_NELTS to avoid congestion drops?

Reply via email to