Hi Luca, That is most probably the reason. We don’t support raw sockets.
Florin > On May 14, 2018, at 1:21 AM, Luca Muscariello (lumuscar) <lumus...@cisco.com> > wrote: > > Hi Florin, > > Session enable does not help. > hping is using raw sockets so this must be the reason. > > Luca > > > > From: Florin Coras <fcoras.li...@gmail.com> > Date: Friday 11 May 2018 at 23:02 > To: Luca Muscariello <lumuscar+f...@cisco.com> > Cc: "vpp-dev@lists.fd.io" <vpp-dev@lists.fd.io> > Subject: Re: [vpp-dev] TCP performance - TSO - HW offloading in general. > > Hi Luca, > > Not really sure why the kernel is slow to reply to ping. Maybe it has to do > with scheduling but it’s just guess work. > > I’ve never tried hping. Let me see if I understand your scenario: while > running iperf you tried to hping the stack and you got no rst back? Anything > interesting in “sh error” counters? If iperf wasn’t running, did you first > enable the stack with “session enable”? > > Florin > > >> On May 11, 2018, at 3:19 AM, Luca Muscariello <lumuscar+f...@cisco.com >> <mailto:lumuscar+f...@cisco.com>> wrote: >> >> Florin, >> >> A few more comments about latency. >> Some number in ms in the table below: >> >> This is ping and iperf3 concurrent. In case of VPP it is vppctl ping. >> >> Kernel w/ load Kernel w/o load VPP w/ load VPP w/o load >> Min. :0.1920 Min. :0.0610 Min. :0.0573 Min. :0.03480 >> 1st Qu.:0.2330 1st Qu.:0.1050 1st Qu.:0.2058 1st Qu.:0.04640 >> Median :0.2450 Median :0.1090 Median :0.2289 Median :0.04880 >> Mean :0.2458 Mean :0.1153 Mean :0.2568 Mean :0.05096 >> 3rd Qu.:0.2720 3rd Qu.:0.1290 3rd Qu.:0.2601 3rd Qu.:0.05270 >> Max. :0.2800 Max. :0.1740 Max. :0.6926 Max. :0.09420 >> >> In short: ICMP packets have a lower latency under load. >> I could interpret this as du to vectorization maybe. Also the Linux kernel >> is slower to reply to ping by x2 factor (system call latency?) 115us vs >> 50us in VPP. w/ load no difference. In this test Linux TCP is using TSO. >> >> While trying to use hping to have a latency sample w/ TCP instead of ICMP >> we noticed that VPP TCP stack does not reply with a RST. So we don’t get >> any sample. Is that expected behavior? >> >> Thanks >> >> >> Luca >> >> >> >> >> >> From: Luca Muscariello <lumus...@cisco.com <mailto:lumus...@cisco.com>> >> Date: Thursday 10 May 2018 at 13:52 >> To: Florin Coras <fcoras.li...@gmail.com <mailto:fcoras.li...@gmail.com>> >> Cc: Luca Muscariello <lumuscar+f...@cisco.com >> <mailto:lumuscar+f...@cisco.com>>, "vpp-dev@lists.fd.io >> <mailto:vpp-dev@lists.fd.io>" <vpp-dev@lists.fd.io >> <mailto:vpp-dev@lists.fd.io>> >> Subject: Re: [vpp-dev] TCP performance - TSO - HW offloading in general. >> >> MTU had no effect, just statistical fluctuations in the test reports. Sorry >> for misreporting the info. >> >> We are exploiting vectorization as we have a single memif channel >> per transport socket so we can control the size of the batches dynamically. >> >> In theory the size of outstanding data from the transport should be >> controlled in bytes for >> batching to be useful and not harmful as frame sizes can vary a lot. But I’m >> not aware of a queue abstraction from DPDK >> to control that from VPP. >> >> From: Florin Coras <fcoras.li...@gmail.com <mailto:fcoras.li...@gmail.com>> >> Date: Wednesday 9 May 2018 at 18:23 >> To: Luca Muscariello <lumus...@cisco.com <mailto:lumus...@cisco.com>> >> Cc: Luca Muscariello <lumuscar+f...@cisco.com >> <mailto:lumuscar+f...@cisco.com>>, "vpp-dev@lists.fd.io >> <mailto:vpp-dev@lists.fd.io>" <vpp-dev@lists.fd.io >> <mailto:vpp-dev@lists.fd.io>> >> Subject: Re: [vpp-dev] TCP performance - TSO - HW offloading in general. >> >> Hi Luca, >> >> We don’t yet support pmtu in the stack so tcp uses a fixed 1460 mtu, unless >> you changed that, we shouldn’t generate jumbo packets. If we do, I’ll have >> to take a look at it :) >> >> If you already had your transport protocol, using memif is the natural way >> to go. Using the session layer makes sense only if you can implement your >> transport within vpp in a way that leverages vectorization or if it can >> leverage the existing transports (see for instance the TLS implementation). >> >> Until today [1] the stack did allow for excessive batching (generation of >> multiple frames in one dispatch loop) but we’re now restricting that to one. >> This is still far from proper pacing which is on our todo list. >> >> Florin >> >> [1] https://gerrit.fd.io/r/#/c/12439/ <https://gerrit.fd.io/r/#/c/12439/> >> >> >> >> >> >>> On May 9, 2018, at 4:21 AM, Luca Muscariello (lumuscar) <lumus...@cisco.com >>> <mailto:lumus...@cisco.com>> wrote: >>> >>> Florin, >>> >>> Thanks for the slide deck, I’ll check it soon. >>> >>> BTW, VPP/DPDK test was using jumbo frames by default so the TCP stack had a >>> little >>> advantage wrt the Linux TCP stack which was using 1500B by default. >>> >>> By manually setting DPDK MTU to 1500B the goodput goes down to 8.5Gbps >>> which compares >>> to 4.5Gbps for Linux w/o TSO. Also congestion window adaptation is not the >>> same. >>> >>> BTW, for what we’re doing it is difficult to reuse the VPP session layer as >>> it is. >>> Our transport stack uses a different kind of namespace and mux/demux is >>> also different. >>> >>> We are using memif as underlying driver which does not seem to be a >>> bottleneck as we can also control batching there. Also, we have our own >>> shared memory downstream memif inside VPP through a plugin. >>> >>> What we observed is that delay-based congestion control does not like >>> much VPP batching (batching in general) and we are using DBCG. >>> >>> Linux TSO has the same problem but has TCP pacing to limit bad effects of >>> bursts >>> on RTT/losses and flow control laws. >>> >>> I guess you’re aware of these issues already. >>> >>> Luca >>> >>> >>> From: Florin Coras <fcoras.li...@gmail.com <mailto:fcoras.li...@gmail.com>> >>> Date: Monday 7 May 2018 at 22:23 >>> To: Luca Muscariello <lumus...@cisco.com <mailto:lumus...@cisco.com>> >>> Cc: Luca Muscariello <lumuscar+f...@cisco.com >>> <mailto:lumuscar+f...@cisco.com>>, "vpp-dev@lists.fd.io >>> <mailto:vpp-dev@lists.fd.io>" <vpp-dev@lists.fd.io >>> <mailto:vpp-dev@lists.fd.io>> >>> Subject: Re: [vpp-dev] TCP performance - TSO - HW offloading in general. >>> >>> Yes, the whole host stack uses shared memory segments and fifos that the >>> session layer manages. For a brief description of the session layer see [1, >>> 2]. Apart from that, unfortunately, we don’t have any other dev >>> documentation. src/vnet/session/segment_manager.[ch] has some good examples >>> of how to allocate segments and fifos. Under application_interface.h check >>> app_[send|recv]_[stream|dgram]_raw for examples on how to read/write to the >>> fifos. <> >>> >>> Now, regarding the the writing to the fifos: they are lock free but size >>> increments are atomic since the assumption is that we’ll always have one >>> reader and one writer. Still, batching helps. VCL doesn’t do it but iperf >>> probably does it. >>> >>> Hope this helps, >>> Florin >>> >>> [1] https://wiki.fd.io/view/VPP/HostStack/SessionLayerArchitecture >>> <https://wiki.fd.io/view/VPP/HostStack/SessionLayerArchitecture> >>> [2] https://wiki.fd.io/images/1/15/Vpp-hoststack-kc-eu-18.pdf >>> <https://wiki.fd.io/images/1/15/Vpp-hoststack-kc-eu-18.pdf> >>> >>> >>> >>> >>> >>>> On May 7, 2018, at 11:35 AM, Luca Muscariello (lumuscar) >>>> <lumus...@cisco.com <mailto:lumus...@cisco.com>> wrote: >>>> >>>> Florin, >>>> >>>> So the TCP stack does not connect to VPP using memif. >>>> I’ll check the shared memory you mentioned. >>>> >>>> For our transport stack we’re using memif. Nothing to >>>> do with TCP though. >>>> >>>> Iperf3 to VPP there must be copies anyway. >>>> There must be some batching with timing though >>>> while doing these copies. >>>> >>>> Is there any doc of svm_fifo usage? >>>> >>>> Thanks >>>> Luca >>>> >>>> On 7 May 2018, at 20:00, Florin Coras <fcoras.li...@gmail.com >>>> <mailto:fcoras.li...@gmail.com>> wrote: >>>> >>>>> Hi Luca, >>>>> >>>>> I guess, as you did, that it’s vectorization. VPP is really good at >>>>> pushing packets whereas Linux is good at using all hw optimizations. >>>>> >>>>> The stack uses it’s own shared memory mechanisms (check svm_fifo_t) but >>>>> given that you did the testing with iperf3, I suspect the edge is not >>>>> there. That is, I guess they’re not abusing syscalls with lots of small >>>>> writes. Moreover, the fifos are not zero-copy, apps do have to write to >>>>> the fifo and vpp has to packetize that data. >>>>> >>>>> Florin >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> On May 7, 2018, at 10:29 AM, Luca Muscariello (lumuscar) >>>>>> <lumus...@cisco.com <mailto:lumus...@cisco.com>> wrote: >>>>>> >>>>>> Hi Florin >>>>>> >>>>>> Thanks for the info. >>>>>> >>>>>> So, how do you explain VPP TCP stack beats Linux >>>>>> implementation by doubling the goodput? >>>>>> Does it come from vectorization? >>>>>> Any special memif optimization underneath? >>>>>> >>>>>> Luca >>>>>> >>>>>> On 7 May 2018, at 18:17, Florin Coras <fcoras.li...@gmail.com >>>>>> <mailto:fcoras.li...@gmail.com>> wrote: >>>>>> >>>>>>> Hi Luca, >>>>>>> >>>>>>> We don’t yet support TSO because it requires support within all of vpp >>>>>>> (think tunnels). Still, it’s on our list. >>>>>>> >>>>>>> As for crypto offload, we do have support for IPSec offload with QAT >>>>>>> cards and we’re now working with Ping and Ray from Intel on >>>>>>> accelerating the TLS OpenSSL engine also with QAT cards. >>>>>>> >>>>>>> Regards, >>>>>>> Florin >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On May 7, 2018, at 7:53 AM, Luca Muscariello <lumuscar+f...@cisco.com >>>>>>>> <mailto:lumuscar+f...@cisco.com>> wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> A few questions about the TCP stack and HW offloading. >>>>>>>> Below is the experiment under test. >>>>>>>> >>>>>>>> +------------+ +-----------+ >>>>>>>> | +-----+ DPDK-10GE| | >>>>>>>> |Iperf3| TCP | +------------+ |TCP Iperf3 >>>>>>>> | +------------+Nexus Switch+------+ + >>>>>>>> |LXC | VPP|| +------------+ |VPP |LXC | >>>>>>>> +------------+ DPDK-10GE +-----------+ >>>>>>>> >>>>>>>> >>>>>>>> Using the Linux kernel w/ or w/o TSO I get an iperf3 goodput of >>>>>>>> 9.5Gbps or 4.5Gbps. >>>>>>>> Using VPP TCP stack I get 9.2Gbps, say max goodput as Linux w/ TSO. >>>>>>>> >>>>>>>> Is there any TSO implementation already in VPP one can take advantage >>>>>>>> of? >>>>>>>> >>>>>>>> Side question. Is there any crypto offloading service available in VPP? >>>>>>>> Essentially for the computation of RSA-1024/2048, EDCSA 192/256 >>>>>>>> signatures. >>>>>>>> >>>>>>>> Thanks >>>>>>>> Luca >>>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> >>>>>