If you can handle the traffic with a single thread then all multi-worker issues would go away. But the congestion drops are seen easily with as little as two workers due to infra limitations.
Regards, Klement > On 13 Nov 2020, at 18:41, Marcos - Mgiga <mar...@mgiga.com.br> wrote: > > Thanks, you see reducing the number of VPP threads as an option to work this > issue around, since you would probably increase the vector rate per thread? > > Best Regards > > -----Mensagem original----- > De: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> Em nome de Klement Sekera via > lists.fd.io > Enviada em: sexta-feira, 13 de novembro de 2020 14:26 > Para: Marcos - Mgiga <mar...@mgiga.com.br> > Cc: Elias Rudberg <elias.rudb...@bahnhof.net>; vpp-dev <vpp-dev@lists.fd.io> > Assunto: Re: RES: RES: [vpp-dev] Increasing NAT worker handoff frame queue > size NAT_FQ_NELTS to avoid congestion drops? > > I used the usual > > 1. start traffic > 2. clear run > 3. wait n seconds (e.g. n == 10) > 4. show run > > Klement > >> On 13 Nov 2020, at 18:21, Marcos - Mgiga <mar...@mgiga.com.br> wrote: >> >> Understood. And what path did you take in order to analyse and monitor >> vector rates ? Is there some specific command or log ? >> >> Thanks >> >> Marcos >> >> -----Mensagem original----- >> De: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> Em nome de ksekera via >> [] Enviada em: sexta-feira, 13 de novembro de 2020 14:02 >> Para: Marcos - Mgiga <mar...@mgiga.com.br> >> Cc: Elias Rudberg <elias.rudb...@bahnhof.net>; vpp-dev@lists.fd.io >> Assunto: Re: RES: [vpp-dev] Increasing NAT worker handoff frame queue size >> NAT_FQ_NELTS to avoid congestion drops? >> >> Not completely idle, more like medium load. Vector rates at which I saw >> congestion drops were roughly 40 for thread doing no work (just handoffs - I >> hardcoded it this way for test purpose), and roughly 100 for thread picking >> the packets doing NAT. >> >> What got me into infra investigation was the fact that once I was hitting >> vector rates around 255, I did see packet drops, but no congestion drops. >> >> HTH, >> Klement >> >>> On 13 Nov 2020, at 17:51, Marcos - Mgiga <mar...@mgiga.com.br> wrote: >>> >>> So you mean that this situation ( congestion drops) is most likely to occur >>> when the system in general is idle than when it is processing a large >>> amount of traffic? >>> >>> Best Regards >>> >>> Marcos >>> >>> -----Mensagem original----- >>> De: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> Em nome de Klement >>> Sekera via lists.fd.io Enviada em: sexta-feira, 13 de novembro de >>> 2020 >>> 12:15 >>> Para: Elias Rudberg <elias.rudb...@bahnhof.net> >>> Cc: vpp-dev@lists.fd.io >>> Assunto: Re: [vpp-dev] Increasing NAT worker handoff frame queue size >>> NAT_FQ_NELTS to avoid congestion drops? >>> >>> Hi Elias, >>> >>> I’ve already debugged this and came to the conclusion that it’s the infra >>> which is the weak link. I was seeing congestion drops at mild load, but not >>> at full load. Issue is that with handoff, there is uneven workload. For >>> simplicity’s sake, just consider thread 1 handing off all the traffic to >>> thread 2. What happens is that for thread 1, the job is much easier, it >>> just does some ip4 parsing and then hands packet to thread 2, which >>> actually does the heavy lifting of hash inserts/lookups/translation etc. 64 >>> element queue can hold 64 frames, one extreme is 64 1-packet frames, >>> totalling 64 packets, other extreme is 64 255-packet frames, totalling ~16k >>> packets. What happens is this: thread 1 is mostly idle and just picking a >>> few packets from NIC and every one of these small frames creates an entry >>> in the handoff queue. Now thread 2 picks one element from the handoff queue >>> and deals with it before picking another one. If the queue has only >>> 3-packet or 10-packet elements, then thread 2 can never really get into >>> what VPP excels in - bulk processing. >>> >>> Q: Why doesn’t it pick as many packets as possible from the handoff queue? >>> A: It’s not implemented. >>> >>> I already wrote a patch for it, which made all congestion drops which I saw >>> (in above synthetic test case) disappear. Mentioned patch >>> https://gerrit.fd.io/r/c/vpp/+/28980 is sitting in gerrit. >>> >>> Would you like to give it a try and see if it helps your issue? We >>> shouldn’t need big queues under mild loads anyway … >>> >>> Regards, >>> Klement >>> >>>> On 13 Nov 2020, at 16:03, Elias Rudberg <elias.rudb...@bahnhof.net> wrote: >>>> >>>> Hello VPP experts, >>>> >>>> We are using VPP for NAT44 and we get some "congestion drops", in a >>>> situation where we think VPP is far from overloaded in general. Then >>>> we started to investigate if it would help to use a larger handoff >>>> frame queue size. In theory at least, allowing a longer queue could >>>> help avoiding drops in case of short spikes of traffic, or if it >>>> happens that some worker thread is temporarily busy for whatever >>>> reason. >>>> >>>> The NAT worker handoff frame queue size is hard-coded in the >>>> NAT_FQ_NELTS macro in src/plugins/nat/nat.h where the current value >>>> is 64. The idea is that putting a larger value there could help. >>>> >>>> We have run some tests where we changed the NAT_FQ_NELTS value from >>>> 64 to a range of other values, each time rebuilding VPP and running >>>> an identical test, a test case that is to some extent trying to >>>> mimic our real traffic, although of course it is simplified. The >>>> test runs many >>>> iperf3 tests simultaneously using TCP, combined with some UDP >>>> traffic chosen to trigger VPP to create more new sessions (to make >>>> the NAT "slowpath" happen more). >>>> >>>> The following NAT_FQ_NELTS values were tested: >>>> 16 >>>> 32 >>>> 64 <-- current value >>>> 128 >>>> 256 >>>> 512 >>>> 1024 >>>> 2048 <-- best performance in our tests >>>> 4096 >>>> 8192 >>>> 16384 >>>> 32768 >>>> 65536 >>>> 131072 >>>> >>>> In those tests, performance was very bad for the smallest >>>> NAT_FQ_NELTS values of 16 and 32, while values larger than 64 gave >>>> improved performance. The best results in terms of throughput were >>>> seen for NAT_FQ_NELTS=2048. For even larger values than that, we got >>>> reduced performance compared to the 2048 case. >>>> >>>> The tests were done for VPP 20.05 running on a Ubuntu 18.04 server >>>> with a 12-core Intel Xeon CPU and two Mellanox mlx5 network cards. >>>> The number of NAT threads was 8 in some of the tests and 4 in some >>>> of the tests. >>>> >>>> According to these tests, the effect of changing NAT_FQ_NELTS can be >>>> quite large. For example, for one test case chosen such that >>>> congestion drops were a significant problem, the throughput >>>> increased from about 43 to 90 Gbit/second with the amount of >>>> congestion drops per second reduced to about one third. In another >>>> kind of test, throughput increased by about 20% with congestion >>>> drops reduced to zero. Of course such results depend a lot on how >>>> the tests are constructed. But anyway, it seems clear that the >>>> choice of NAT_FQ_NELTS value can be important and that increasing it >>>> would be good, at least for the kind of usage we have tested now. >>>> >>>> Based on the above, we are considering changing NAT_FQ_NELTS from 64 >>>> to a larger value and start trying that in our production >>>> environment (so far we have only tried it in a test environment). >>>> >>>> Were there specific reasons for setting NAT_FQ_NELTS to 64? >>>> >>>> Are there some potential drawbacks or dangers of changing it to a >>>> larger value? >>>> >>>> Would you consider changing to a larger value in the official VPP >>>> code? >>>> >>>> Best regards, >>>> Elias >>>> >>>> >>>> >>>> >>> >>> >> >> > >
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#18036): https://lists.fd.io/g/vpp-dev/message/18036 Mute This Topic: https://lists.fd.io/mt/78234850/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-