Understood. And what path did you take in order to analyse and monitor vector rates ? Is there some specific command or log ?
Thanks Marcos -----Mensagem original----- De: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> Em nome de ksekera via [] Enviada em: sexta-feira, 13 de novembro de 2020 14:02 Para: Marcos - Mgiga <mar...@mgiga.com.br> Cc: Elias Rudberg <elias.rudb...@bahnhof.net>; vpp-dev@lists.fd.io Assunto: Re: RES: [vpp-dev] Increasing NAT worker handoff frame queue size NAT_FQ_NELTS to avoid congestion drops? Not completely idle, more like medium load. Vector rates at which I saw congestion drops were roughly 40 for thread doing no work (just handoffs - I hardcoded it this way for test purpose), and roughly 100 for thread picking the packets doing NAT. What got me into infra investigation was the fact that once I was hitting vector rates around 255, I did see packet drops, but no congestion drops. HTH, Klement > On 13 Nov 2020, at 17:51, Marcos - Mgiga <mar...@mgiga.com.br> wrote: > > So you mean that this situation ( congestion drops) is most likely to occur > when the system in general is idle than when it is processing a large amount > of traffic? > > Best Regards > > Marcos > > -----Mensagem original----- > De: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> Em nome de Klement > Sekera via lists.fd.io Enviada em: sexta-feira, 13 de novembro de 2020 > 12:15 > Para: Elias Rudberg <elias.rudb...@bahnhof.net> > Cc: vpp-dev@lists.fd.io > Assunto: Re: [vpp-dev] Increasing NAT worker handoff frame queue size > NAT_FQ_NELTS to avoid congestion drops? > > Hi Elias, > > I’ve already debugged this and came to the conclusion that it’s the infra > which is the weak link. I was seeing congestion drops at mild load, but not > at full load. Issue is that with handoff, there is uneven workload. For > simplicity’s sake, just consider thread 1 handing off all the traffic to > thread 2. What happens is that for thread 1, the job is much easier, it just > does some ip4 parsing and then hands packet to thread 2, which actually does > the heavy lifting of hash inserts/lookups/translation etc. 64 element queue > can hold 64 frames, one extreme is 64 1-packet frames, totalling 64 packets, > other extreme is 64 255-packet frames, totalling ~16k packets. What happens > is this: thread 1 is mostly idle and just picking a few packets from NIC and > every one of these small frames creates an entry in the handoff queue. Now > thread 2 picks one element from the handoff queue and deals with it before > picking another one. If the queue has only 3-packet or 10-packet elements, > then thread 2 can never really get into what VPP excels in - bulk processing. > > Q: Why doesn’t it pick as many packets as possible from the handoff queue? > A: It’s not implemented. > > I already wrote a patch for it, which made all congestion drops which I saw > (in above synthetic test case) disappear. Mentioned patch > https://gerrit.fd.io/r/c/vpp/+/28980 is sitting in gerrit. > > Would you like to give it a try and see if it helps your issue? We > shouldn’t need big queues under mild loads anyway … > > Regards, > Klement > >> On 13 Nov 2020, at 16:03, Elias Rudberg <elias.rudb...@bahnhof.net> wrote: >> >> Hello VPP experts, >> >> We are using VPP for NAT44 and we get some "congestion drops", in a >> situation where we think VPP is far from overloaded in general. Then >> we started to investigate if it would help to use a larger handoff >> frame queue size. In theory at least, allowing a longer queue could >> help avoiding drops in case of short spikes of traffic, or if it >> happens that some worker thread is temporarily busy for whatever >> reason. >> >> The NAT worker handoff frame queue size is hard-coded in the >> NAT_FQ_NELTS macro in src/plugins/nat/nat.h where the current value >> is 64. The idea is that putting a larger value there could help. >> >> We have run some tests where we changed the NAT_FQ_NELTS value from >> 64 to a range of other values, each time rebuilding VPP and running >> an identical test, a test case that is to some extent trying to mimic >> our real traffic, although of course it is simplified. The test runs >> many >> iperf3 tests simultaneously using TCP, combined with some UDP traffic >> chosen to trigger VPP to create more new sessions (to make the NAT >> "slowpath" happen more). >> >> The following NAT_FQ_NELTS values were tested: >> 16 >> 32 >> 64 <-- current value >> 128 >> 256 >> 512 >> 1024 >> 2048 <-- best performance in our tests >> 4096 >> 8192 >> 16384 >> 32768 >> 65536 >> 131072 >> >> In those tests, performance was very bad for the smallest >> NAT_FQ_NELTS values of 16 and 32, while values larger than 64 gave >> improved performance. The best results in terms of throughput were >> seen for NAT_FQ_NELTS=2048. For even larger values than that, we got >> reduced performance compared to the 2048 case. >> >> The tests were done for VPP 20.05 running on a Ubuntu 18.04 server >> with a 12-core Intel Xeon CPU and two Mellanox mlx5 network cards. >> The number of NAT threads was 8 in some of the tests and 4 in some of >> the tests. >> >> According to these tests, the effect of changing NAT_FQ_NELTS can be >> quite large. For example, for one test case chosen such that >> congestion drops were a significant problem, the throughput increased >> from about 43 to 90 Gbit/second with the amount of congestion drops >> per second reduced to about one third. In another kind of test, >> throughput increased by about 20% with congestion drops reduced to >> zero. Of course such results depend a lot on how the tests are >> constructed. But anyway, it seems clear that the choice of >> NAT_FQ_NELTS value can be important and that increasing it would be >> good, at least for the kind of usage we have tested now. >> >> Based on the above, we are considering changing NAT_FQ_NELTS from 64 >> to a larger value and start trying that in our production environment >> (so far we have only tried it in a test environment). >> >> Were there specific reasons for setting NAT_FQ_NELTS to 64? >> >> Are there some potential drawbacks or dangers of changing it to a >> larger value? >> >> Would you consider changing to a larger value in the official VPP >> code? >> >> Best regards, >> Elias >> >> >> >> > >
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#18019): https://lists.fd.io/g/vpp-dev/message/18019 Mute This Topic: https://lists.fd.io/mt/78234440/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-