Re: [PATCH v2] kni: fix possible alloc_q starvation when mbufs are exhausted
On Thu, Nov 10, 2022 at 12:39 AM Stephen Hemminger < step...@networkplumber.org> wrote: > On Wed, 9 Nov 2022 14:04:34 +0800 > Yangchao Zhou wrote: > > > In some scenarios, mbufs returned by rte_kni_rx_burst are not freed > > immediately. So kni_allocate_mbufs may be failed, but we don't know. > > > > Even worse, when alloc_q is completely exhausted, kni_net_tx in > > rte_kni.ko will drop all tx packets. kni_allocate_mbufs is never > > called again, even if the mbufs are eventually freed. > > > > In this patch, we always try to allocate mbufs for alloc_q. > > > > Don't worry about alloc_q being allocated too many mbufs, in fact, > > the old logic will gradually fill up alloc_q. > > Also, the cost of more calls to kni_allocate_mbufs should be acceptable. > > > > Fixes: 3e12a98fe397 ("kni: optimize Rx burst") > > Cc: hem...@freescale.com > > Cc: sta...@dpdk.org > > > > Signed-off-by: Yangchao Zhou > > Since fifo_get returning 0 (no buffers) is very common would this > change impact performance. > It does add a little cost, but there is no extra mbuf allocation and deallocation. > > If the problem is pool draining might be better to make the pool > bigger. > Yes, using a larger pool can avoid this problem. But this may lead to resource wastage and full resource calculation is a challenge for developers as it involves to mempool caching mechanism, IP fragment cache, ARP cache, NIC txq, other transit queue, etc. The mbuf allocation failure may also occur on many NIC drivers, but if the mbuf allocation fails, the mbuf is not taken out so that it can be recovered after a retry later. KNI currently does not have such a takedown and recovery mechanism. It is also possible to consider implementing something similar to the NIC driver, but with more changes and other overheads.
Re: [PATCH v3] kni: fix possible alloc_q starvation when mbufs are exhausted
Hi Ferruh, In my case, the traffic is not large, so I can't see the impact. I also tested under high load(>2Mpps with 2 DPDK cores and 2 kernel threads) and found no significant difference in performance either. I think the reason should be that it will not run to 'kni_fifo_count(kni->alloc_q) == 0' under high load. On Tue, Jan 3, 2023 at 8:47 PM Ferruh Yigit wrote: > On 12/30/2022 4:23 AM, Yangchao Zhou wrote: > > In some scenarios, mbufs returned by rte_kni_rx_burst are not freed > > immediately. So kni_allocate_mbufs may be failed, but we don't know. > > > > Even worse, when alloc_q is completely exhausted, kni_net_tx in > > rte_kni.ko will drop all tx packets. kni_allocate_mbufs is never > > called again, even if the mbufs are eventually freed. > > > > In this patch, we try to allocate mbufs for alloc_q when it is empty. > > > > According to historical experience, the performance bottleneck of KNI > > is offen the usleep_range of kni thread in rte_kni.ko. > > The check of kni_fifo_count is trivial and the cost should be acceptable. > > > > Hi Yangchao, > > Are you observing any performance impact with this change in you use case? > > > > Fixes: 3e12a98fe397 ("kni: optimize Rx burst") > > Cc: sta...@dpdk.org > > > > Signed-off-by: Yangchao Zhou > > --- > > lib/kni/rte_kni.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/lib/kni/rte_kni.c b/lib/kni/rte_kni.c > > index 8ab6c47153..bfa6a001ff 100644 > > --- a/lib/kni/rte_kni.c > > +++ b/lib/kni/rte_kni.c > > @@ -634,8 +634,8 @@ rte_kni_rx_burst(struct rte_kni *kni, struct > rte_mbuf **mbufs, unsigned int num) > > { > > unsigned int ret = kni_fifo_get(kni->tx_q, (void **)mbufs, num); > > > > - /* If buffers removed, allocate mbufs and then put them into > alloc_q */ > > - if (ret) > > + /* If buffers removed or alloc_q is empty, allocate mbufs and then > put them into alloc_q */ > > + if (ret || (kni_fifo_count(kni->alloc_q) == 0)) > > kni_allocate_mbufs(kni); > > > > return ret; > >
[dpdk-dev] segmented recv ixgbevf
Hey Folks, I ran into the same issue that Alex is describing here, and I wanted to expand just a little bit on his comments, as the documentation isn't very clear. Per the documentation, the two arguments to rte_pktmbuf_pool_init() are a pointer to the memory pool that contains the newly-allocated mbufs and an opaque pointer. The docs are pretty vague about what the opaque pointer should point to or what it's contents mean; all of the examples I looked at just pass a NULL pointer. The docs for this function describe the opaque pointer this way: "A pointer that can be used by the user to retrieve useful information for mbuf initialization. This pointer comes from the init_arg parameter of rte_mempool_create() <http://www.dpdk.org/doc/api/rte__mempool_8h.html#a7dc1d01a45144e3203c36d1800cb8f17> ." This is a little bit misleading. Under the covers, rte_pktmbuf_pool_init() doesn't threat the opaque pointer as a pointer at all. Rather, it just converts it to a uint16_t which contains the desired mbuf size. If it receives 0 (in other words, if you passed in a NULL pointer), it will use 2048 bytes + RTE_PKTMBUF_HEADROOM. Hence, incoming jumbo frames will be segmented into 2K chunks. Any chance we could get an improvement to the documentation about this parameter? It seems as though the opaque pointer isn't a pointer and probably shouldn't be opaque. Hope this helps the next person who comes across this behavior. -- Matt Laswell infinite io, inc. On Thu, Oct 30, 2014 at 7:48 AM, Alex Markuze wrote: > For posterity. > > 1.When using MTU larger then 2K its advised to provide the value > to rte_pktmbuf_pool_init. > 2.ixgbevf rounds down the ("MBUF size" - RTE_PKTMBUF_HEADROOM) to the > nearest 1K multiple when deciding on the receiving capabilities [buffer > size]of the Buffers in the pool. > The function SRRCTL register, is considered here for some reason? >
[dpdk-dev] A question about hugepage initialization time
Hey Folks, Our DPDK application deals with very large in memory data structures, and can potentially use tens or even hundreds of gigabytes of hugepage memory. During the course of development, we've noticed that as the number of huge pages increases, the memory initialization time during EAL init gets to be quite long, lasting several minutes at present. The growth in init time doesn't appear to be linear, which is concerning. This is a minor inconvenience for us and our customers, as memory initialization makes our boot times a lot longer than it would otherwise be. Also, my experience has been that really long operations often are hiding errors - what you think is merely a slow operation is actually a timeout of some sort, often due to misconfiguration. This leads to two questions: 1. Does the long initialization time suggest that there's an error happening under the covers? 2. If not, is there any simple way that we can shorten memory initialization time? Thanks in advance for your insights. -- Matt Laswell laswell at infiniteio.com infinite io, inc.
[dpdk-dev] A question about hugepage initialization time
Hey Everybody, Thanks for the feedback. Yeah, we're pretty sure that the amount of memory we work with is atypical, and we're hitting something that isn't an issue for most DPDK users. To clarify, yes, we're using 1GB hugepages, and we set them up via hugepagesz and hugepages= in our kernel's grub line. We find that when we use four 1GB huge pages, eal memory init takes a couple of seconds, which is no big deal. When we use 128 1GB pages, though, memory init can take several minutes. The concern is that we will very likely use even more memory in the future. Our boot time is mostly just a nuisance now; nonlinear growth in memory init time may transform it into a larger problem. We've had to disable transparent hugepages due to latency issues with in-memory databases. I'll have to look at the possibility of alternative memset implementations. Perhaps some profiler time is in my future. Again, thanks to everybody for the useful information. -- Matt Laswell laswell at infiniteio.com infinite io, inc. On Tue, Dec 9, 2014 at 1:06 PM, Matthew Hall wrote: > On Tue, Dec 09, 2014 at 10:33:59AM -0600, Matt Laswell wrote: > > Our DPDK application deals with very large in memory data structures, and > > can potentially use tens or even hundreds of gigabytes of hugepage > memory. > > What you're doing is an unusual use case and this is open source code where > nobody might have tested and QA'ed this yet. > > So my recommendation would be adding some rte_log statements to measure the > various steps in the process to see what's going on. Also using the Linux > Perf > framework to do low-overhead sampling-based profiling, and making sure > you've > got everything compiled with debug symbols so you can see what's consuming > the > execution time. > > You might find that it makes sense to use some custom allocators like > jemalloc > alongside of the DPDK allocators, including perhaps "transparent hugepage > mode" in your process, and some larger page sizes to reduce the number of > pages. > > You can also use this handy kernel options, hugepagesz= hugepages=N . > This creates guaranteed-contiguous known-good hugepages during boot which > initialize much more quickly with less trouble and glitches in my > experience. > > https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt > https://www.kernel.org/doc/Documentation/vm/transhuge.txt > > There is no one-size-fits-all solution but these are some possibilities. > > Good Luck, > Matthew. >
[dpdk-dev] Ability to/impact of running with smaller page sizes
Hey Folks, In my application, I'm seeing some design considerations in a project I'm working on that push me towards the use of smaller memory page sizes. I'm curious - is it possible in practical terms to run DPDK without hugepages? If so, does anybody have any practical experience (or a back-of-the-envelop estimate) of how badly such a configuration would hurt performance? For sake of argument, assume that virtually all of the memory being used is in pre-allocated mempools (e.g lots of rte_mempool_create(), very little rte_malloc(). Thanks in advance for your help. -- Matt Laswell
[dpdk-dev] Ability to/impact of running with smaller page sizes
Thanks everybody, It sounds as though what I'm looking for may be possible, especially with 1.7, but will require some tweaking and there will most definitely be a performance hit. That's great information. This is still just an experiment for us, and it's not at all guaranteed that I'm going to move towards smaller pages, but I very much appreciate the insights. -- Matt Laswell On Tue, Jul 1, 2014 at 6:51 AM, Burakov, Anatoly wrote: > Hi Matt, > > > I'm curious - is it possible in practical terms to run DPDK without > hugepages? > > Starting with release 1.7.0, support for VFIO was added, which allows > using DPDK without hugepages at al (including RX/TX rings) via the > --no-huge command-line parameter. Bear in mind though that you'll have to > have IOMMU/VT-d enabled (i.e. no VM support, only host-based) and also have > supported kernel version (3.6+) as well to use VFIO, the memory size will > be limited to 1G, and it won't work with multiprocess. I don't have any > performance figures on that unfortunately. > > Best regards, > Anatoly Burakov > DPDK SW Engineer >
[dpdk-dev] DPDK with Ubuntu 14.04?
Hey Folks, I know that official support hasn't moved past Ubuntu 12.04 LTS yet, but does anybody have any practical experience running with 14.04 LTS? My team has run into one compilation error so far with 1.7, but other than that things look OK at first blush. I'd like to move my product to 14.04 for a variety of reasons, but would hate to spend time chasing down subtle incompatibilities. I'm guessing we're not the first ones to try this... Thanks. -- Matt Laswell infinite io
[dpdk-dev] DPDK with Ubuntu 14.04?
Thanks Roger, We saw similar issues with regard to kcompat.h. Can I ask if you've done anything beyond the example applications under 14.04? -- Matt Laswell infinite io On Thu, Jul 10, 2014 at 7:07 PM, Wiles, Roger Keith < keith.wiles at windriver.com> wrote: > The one problem I had with 14.04 was the kcompat.h file. It looks like a > hash routine has changed its arguments. I edited the kcompat.h file and was > about to change the code to allow DPDK to build. It is not affix but it > worked for me. > > lib/librte_eal/linuxapp/kni/ethtool/igb/kcompat.h > > /* Changed the next line to use (3,13,8) instead of (3,14,0) KeithW > */ > #if ( LINUX_VERSION_CODE < KERNEL_VERSION(3,13,8) ) > #if (!(RHEL_RELEASE_CODE && RHEL_RELEASE_CODE >= > RHEL_RELEASE_VERSION(7,0))) > #ifdef NETIF_F_RXHASH > #define PKT_HASH_TYPE_L3 0 > > *Hope that works.* > > *Keith **Wiles*, Principal Technologist with CTO office, *Wind River* > mobile 972-213-5533 > > [image: Powering 30 Years of Innovation] > <http://www.windriver.com/announces/wr30/> > > On Jul 10, 2014, at 5:56 PM, Matt Laswell wrote: > > Hey Folks, > > I know that official support hasn't moved past Ubuntu 12.04 LTS yet, but > does anybody have any practical experience running with 14.04 LTS? My team > has run into one compilation error so far with 1.7, but other than that > things look OK at first blush. I'd like to move my product to 14.04 for a > variety of reasons, but would hate to spend time chasing down subtle > incompatibilities. I'm guessing we're not the first ones to try this... > > Thanks. > > -- > Matt Laswell > infinite io > > >
[dpdk-dev] Question about ASLR
Hey Folks, A colleague noticed warnings in section 23.3 of the programmer's guide about the use of address space layout randomization with multiprocess DPDK applications. And, upon inspection, it appears that ASLR is enabled on our target systems. We've never seen a problem that we could trace back to ASLR, and we've never see a warning during EAL memory initialiization, either, which is strange. Given the choice, we would prefer to keep ASLR for security reasons. Given that in our problem domain: - We are running a multiprocess DPDK application - We run only one DPDK application, which is a single compiled binary - We have exactly one process running per logical core - We're OK with interrupts coming just to the primary - We handle interaction from our control plane via a separate shared memory space Is it OK in this circumstance to leave ASLR enabled? I think it probably is, but would love to hear reasons why not and/or pitfalls that we need to avoid. Thanks in advance. -- Matt Laswell *infinite io*
[dpdk-dev] Question about ASLR
Bruce, That's tremendously helpful. Thanks for the information. -- Matt Laswell *infinite io* On Sun, Sep 7, 2014 at 2:52 PM, Richardson, Bruce < bruce.richardson at intel.com> wrote: > > -Original Message- > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Matt Laswell > > Sent: Friday, September 05, 2014 7:57 PM > > To: dev at dpdk.org > > Subject: [dpdk-dev] Question about ASLR > > > > Hey Folks, > > > > A colleague noticed warnings in section 23.3 of the programmer's guide > > about the use of address space layout randomization with multiprocess > DPDK > > applications. And, upon inspection, it appears that ASLR is enabled on > our > > target systems. We've never seen a problem that we could trace back to > > ASLR, and we've never see a warning during EAL memory initialiization, > > either, which is strange. > > > > Given the choice, we would prefer to keep ASLR for security reasons. > Given > > that in our problem domain: > >- We are running a multiprocess DPDK application > >- We run only one DPDK application, which is a single compiled binary > >- We have exactly one process running per logical core > >- We're OK with interrupts coming just to the primary > >- We handle interaction from our control plane via a separate shared > > memory space > > > > Is it OK in this circumstance to leave ASLR enabled? I think it probably > > is, but would love to hear reasons why not and/or pitfalls that we need > to > > avoid. > > > > Thanks in advance. > > > > -- > > Matt Laswell > > *infinite io* > > Having ASLR enabled will just introduce a small element of uncertainty in > the application startup process as you the memory mappings used by your app > will move about from run to run. In certain cases we've seen some of the > secondary multi-process application examples fail to start at random once > every few hundred times (IIRC correctly - this was some time back). > Presumably the chances of the secondary failing to start will vary > depending on how ASLR has adjusted the memory mappings in the primary. > So, with ASLR on, we've found occasionally that mappings will fail, in > which case the solution is really just to retry the app again and ASLR will > re-randomise it differently and it will likely start. Disabling ASLR gives > repeatability in this regard - your app will always start successfully - or > if there is something blocking the memory maps from being replicated - > always fail to start (in which case you try passing EAL parameters to hint > the primary process to use different mapping addresses). > > In your case, you are not seeing any problems thus far, so likely if > secondary process startup failures do occur, they should hopefully work > fine by just trying again! Whether this element of uncertainty is > acceptable or not is your choice :-). One thing you could try, to find out > what the issues might be with your app, is to just try running it > repeatedly in a script, killing it after a couple of seconds. This should > tell you how often, if ever, initialization failures are to be expected > when using ASLR. > > Hope this helps, > Regards, > /Bruce >
[dpdk-dev] Beyond DPDK 2.0
On Fri, Apr 24, 2015 at 12:39 PM, Jay Rolette wrote: > > I can tell you that if DPDK were GPL-based, my company wouldn't be using > it. I suspect we wouldn't be the only ones... > I want to emphasize this point. It's unsurprising that Jay and I agree, since we work together. But I can say with quite a bit of confidence that my last employer also would stop using DPDK if it were GPL licensed. Or, if they didn't jettison it entirely, they would never move beyond the last BSD-licensed version. If you want to incentivize companies to support DPDK, the first step is to ensure they're using it. For that reason, GPL seems like a step in the wrong direction to me. - Matt
Re: [dpdk-dev] Occasional instability in RSS Hashes/Queues from X540 NIC
Hi Qiming, That's fantastic news. Thank you very much for taking the time to figure the issue out. Would it be possible to backport the fix to the 16.11 LTS release? This kind of problem seems tailor-made for LTS. -- Matt Laswell lasw...@infinite.io On Tue, Jul 18, 2017 at 3:58 AM, Yang, Qiming wrote: > Hi Matt, > > We can reproduce this RSS issue on 16.04 but can't on 17.02, so this issue > was fixed on 17.02. > We suggest using the new version. > > Qiming > > -Original Message- > > From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Matt Laswell > > Sent: Friday, May 5, 2017 9:05 PM > > To: dev@dpdk.org > > Subject: Re: [dpdk-dev] Occasional instability in RSS Hashes/Queues from > X540 > > NIC > > > > On Thu, May 4, 2017 at 1:15 PM, Matt Laswell > wrote: > > > > > Hey Keith, > > > > > > Here is a hexdump of a subset of one of my packet captures. In this > > > capture, all of the packets are part of the same TCP connection, which > > > happens to be NFSv3 traffic. All of them except packet number 6 get > > > the correct RSS hash and go to the right queue. Packet number 6 (an > NFS > > rename > > > reply with an NFS error) gets RSS hash 0 and goes to queue 0. > Whenever I > > > repeat this test, the reply to this particular rename attempt always > > > goes to the wrong core, though it seemingly differs from the rest of > > > the flow only in layers 4-7. > > > > > > I'll also attach a pcap to this email, in case that's a more > > > convenient way to interact with the packets. > > > > > > -- > > > Matt Laswell > > > lasw...@infinite.io > > > > > > > > > 16:08:37.093306 IP 10.151.3.81.disclose > 10.151.3.161.nfsd: Flags > > > [P.], seq 3173509264:3173509380, ack 3244259549, win 580, options > > > [nop,nop,TS val > > > 23060466 ecr 490971270], length 116: NFS request xid 2690728524 112 > > > access fh > > > > > Unknown/8B6BFEBB0400CFABD1030100DABC0502 > > 01 > > > 00 NFS_ACCESS_READ|NFS_ACCESS_LOOKUP|NFS_ACCESS_MODIFY|NFS_ > > > ACCESS_EXTEND|NFS_ACCESS_DELETE > > > 0x: 4500 00a8 6d0f 4000 4006 b121 0a97 0351 E...m.@.@..!...Q > > > 0x0010: 0a97 03a1 029b 0801 bd27 e890 c15f 78dd .'..._x. > > > 0x0020: 8018 0244 1cba 0101 080a 015f dff2 ...D._.. > > > 0x0030: 1d43 a086 8000 0070 a061 424c .C.p.aBL > > > 0x0040: 0002 0001 86a3 0003 0004 > > > 0x0050: 0001 0020 0107 8d2f 0007 .../ > > > 0x0060: 6573 7869 3275 3100 esxi2u1. > > > 0x0070: 0001 > > > 0x0080: 0020 8b6b febb 0400 cfab d103 .k.. > > > 0x0090: 0100 dabc 0502 > > > 0x00a0: 0100 001f > > > 16:08:37.095837 IP 10.151.3.161.nfsd > 10.151.3.81.disclose: Flags > > > [P.], seq 1:125, ack 116, win 28688, options [nop,nop,TS val 490971270 > > > ecr 23060466], length 124: NFS reply xid 2690728524 reply ok 120 > > > access c 001f > > > 0x: 4500 00b0 1b80 4000 4006 02a9 0a97 03a1 E.@.@... > > > 0x0010: 0a97 0351 0801 029b c15f 78dd bd27 e904 ...Q._x..'.. > > > 0x0020: 8018 7010 a61a 0101 080a 1d43 a086 ..p..C.. > > > 0x0030: 015f dff2 8000 0078 a061 424c 0001 ._.x.aBL > > > 0x0040: > > > 0x0050: 0001 0002 01ed > > > 0x0060: 0003 > > > 0x0070: 0029 0800 00ff ...) > > > 0x0080: 00ff bbfe 6b8b 0001 ..k. > > > 0x0090: 03d1 abcf 5908 f554 3272 e4e6 5908 f554 Y..T2r..Y..T > > > 0x00a0: 3272 e4e6 5908 f554 3365 2612 001f 2r..Y..T3e&. > > > 16:08:37.096235 IP 10.151.3.81.disclose > 10.151.3.161.nfsd: Flags > > > [P.], seq 256:372, ack 285, win 589, options [nop,nop,TS val 23060467 > > > ecr 490971270], length 116: NFS request xid 2724282956 112 access fh > > > Unknown/ > > > > > 8B6BFEBB0400D0ABD1030100DABC05020100 > > > NFS_ACCESS_READ|NFS_ACCESS_LOOKUP|NFS_ACCESS_MODIFY|NFS_ > > > ACCESS_EXTEND|NFS_ACCESS_DELETE > &g
[dpdk-dev] Occasional instability in RSS Hashes/Queues from X540 NIC
Hey Folks, I'm seeing some strange behavior with regard to the RSS hash values in my applications and was hoping somebody might have some pointers on where to look. In my application, I'm using RSS to divide work among multiple cores, each of which services a single RX queue. When dealing with a single long-lived TCP connection, I occasionally see packets going to the wrong core. That is, almost all of the packets in the connection go to core 5 in this case, but every once in a while, one goes to core 0 instead. Upon further investigation, I find two problems are occurring. The first is that problem packets have the RSS hash value in the mbuf incorrectly set to zero. They are therefore put in queue zero, where they are read by core zero. Other packets from the same connection that occur immediately before and after the packet in question have the correct hash value and therefore go to a different core. The second problem is that we sometimes see packets in which the RSS hash in the mbuf appears correct, but the packets are incorrectly put into queue zero. As with the first, this results in the wrong core getting the packet. Either one of these confuses the state tracking we're doing per-core. A few details: - Using an Intel X540-AT2 NIC and the igb_uio driver - DPDK 16.04 - A particular packet in our workflow always encounters this problem. - Retransmissions of the packet in question also encounter the problem - The packet is IPv4, with header length of 20 (so no options), no fragmentation. - The only differences I can see in the IP header between packets that get the right hash value and those that get the wrong one are in the IP ID, total length, and checksum fields. - Using ETH_RSS_IPV4 - The packet is TCP with about 100 bytes of payload - it's not a jumbo or a runt - We fill the key in with 0x6d5a to get symmetric hashing of both sides of the connection - We only configure RSS information at boot; things like the key or header fields are not being changed dynamically - Traffic load is light when the problem occurs Is anybody aware of an errata, either in the NIC or the PMD's configuration of it that might explain something like this? Failing that, if you ran into this sort of behavior, how would you approach finding the reason for the error? Every failure mode I can think of would tend to affect all of the packets in the connection consistently, even if incorrectly. Thanks in advance for any ideas. -- Matt Laswell lasw...@infinite.io
Re: [dpdk-dev] Occasional instability in RSS Hashes/Queues from X540 NIC
Hey Keith, Here is a hexdump of a subset of one of my packet captures. In this capture, all of the packets are part of the same TCP connection, which happens to be NFSv3 traffic. All of them except packet number 6 get the correct RSS hash and go to the right queue. Packet number 6 (an NFS rename reply with an NFS error) gets RSS hash 0 and goes to queue 0. Whenever I repeat this test, the reply to this particular rename attempt always goes to the wrong core, though it seemingly differs from the rest of the flow only in layers 4-7. I'll also attach a pcap to this email, in case that's a more convenient way to interact with the packets. -- Matt Laswell lasw...@infinite.io 16:08:37.093306 IP 10.151.3.81.disclose > 10.151.3.161.nfsd: Flags [P.], seq 3173509264:3173509380, ack 3244259549, win 580, options [nop,nop,TS val 23060466 ecr 490971270], length 116: NFS request xid 2690728524 112 access fh Unknown/8B6BFEBB0400CFABD1030100DABC05020100 NFS_ACCESS_READ|NFS_ACCESS_LOOKUP|NFS_ACCESS_MODIFY|NFS_ACCESS_EXTEND|NFS_ACCESS_DELETE 0x: 4500 00a8 6d0f 4000 4006 b121 0a97 0351 E...m.@.@..!...Q 0x0010: 0a97 03a1 029b 0801 bd27 e890 c15f 78dd .'..._x. 0x0020: 8018 0244 1cba 0101 080a 015f dff2 ...D._.. 0x0030: 1d43 a086 8000 0070 a061 424c .C.p.aBL 0x0040: 0002 0001 86a3 0003 0004 0x0050: 0001 0020 0107 8d2f 0007 .../ 0x0060: 6573 7869 3275 3100 esxi2u1. 0x0070: 0001 0x0080: 0020 8b6b febb 0400 cfab d103 .k.. 0x0090: 0100 dabc 0502 0x00a0: 0100 001f 16:08:37.095837 IP 10.151.3.161.nfsd > 10.151.3.81.disclose: Flags [P.], seq 1:125, ack 116, win 28688, options [nop,nop,TS val 490971270 ecr 23060466], length 124: NFS reply xid 2690728524 reply ok 120 access c 001f 0x: 4500 00b0 1b80 4000 4006 02a9 0a97 03a1 E.@.@... 0x0010: 0a97 0351 0801 029b c15f 78dd bd27 e904 ...Q._x..'.. 0x0020: 8018 7010 a61a 0101 080a 1d43 a086 ..p..C.. 0x0030: 015f dff2 8000 0078 a061 424c 0001 ._.x.aBL 0x0040: 0x0050: 0001 0002 01ed 0x0060: 0003 0x0070: 0029 0800 00ff ...) 0x0080: 00ff bbfe 6b8b 0001 ..k. 0x0090: 03d1 abcf 5908 f554 3272 e4e6 5908 f554 Y..T2r..Y..T 0x00a0: 3272 e4e6 5908 f554 3365 2612 001f 2r..Y..T3e&. 16:08:37.096235 IP 10.151.3.81.disclose > 10.151.3.161.nfsd: Flags [P.], seq 256:372, ack 285, win 589, options [nop,nop,TS val 23060467 ecr 490971270], length 116: NFS request xid 2724282956 112 access fh Unknown/8B6BFEBB0400D0ABD1030100DABC05020100 NFS_ACCESS_READ|NFS_ACCESS_LOOKUP|NFS_ACCESS_MODIFY|NFS_ACCESS_EXTEND|NFS_ACCESS_DELETE 0x: 4500 00a8 6d11 4000 4006 b11f 0a97 0351 E...m.@.@..Q 0x0010: 0a97 03a1 029b 0801 bd27 e990 c15f 79f9 .'..._y. 0x0020: 8018 024d 1cba 0101 080a 015f dff3 ...M._.. 0x0030: 1d43 a086 8000 0070 a261 424c .C.p.aBL 0x0040: 0002 0001 86a3 0003 0004 0x0050: 0001 0020 0107 8d2f 0007 .../ 0x0060: 6573 7869 3275 3100 esxi2u1. 0x0070: 0001 0x0080: 0020 8b6b febb 0400 d0ab d103 .k.. 0x0090: 0100 dabc 0502 0x00a0: 0100 001f 16:08:37.098361 IP 10.151.3.161.nfsd > 10.151.3.81.disclose: Flags [P.], seq 285:409, ack 372, win 28688, options [nop,nop,TS val 490971270 ecr 23060467], length 124: NFS reply xid 2724282956 reply ok 120 access c 001f 0x: 4500 00b0 1b81 4000 4006 02a8 0a97 03a1 E.@.@... 0x0010: 0a97 0351 0801 029b c15f 79f9 bd27 ea04 ...Q._y..'.. 0x0020: 8018 7010 ec45 0101 080a 1d43 a086 ..p..E...C.. 0x0030: 015f dff3 8000 0078 a261 424c 0001 ._.x.aBL 0x0040: 0x0050: 0001 0002 01ed 0x0060: 0004 0x0070: 0050 0800 00ff ...P 0x0080: 00ff bbfe 6b8b 0001 ..k. 0x0090: 03d1 abd0 5908 f554 3536 88ea 5908 f554 Y..T56..Y..T 0x00a0: 3536 88ea 5908 f555 01ff bf76 001f 56..Y..U...v 16:08:37.099013 IP 10.151.3.81.disclose > 10.151.3.161.nfsd: Flags [P.], seq 652:856, ack 813, win 605, options [nop,nop,TS val 230
Re: [dpdk-dev] Occasional instability in RSS Hashes/Queues from X540 NIC
On Thu, May 4, 2017 at 1:15 PM, Matt Laswell wrote: > Hey Keith, > > Here is a hexdump of a subset of one of my packet captures. In this > capture, all of the packets are part of the same TCP connection, which > happens to be NFSv3 traffic. All of them except packet number 6 get the > correct RSS hash and go to the right queue. Packet number 6 (an NFS rename > reply with an NFS error) gets RSS hash 0 and goes to queue 0. Whenever I > repeat this test, the reply to this particular rename attempt always goes > to the wrong core, though it seemingly differs from the rest of the flow > only in layers 4-7. > > I'll also attach a pcap to this email, in case that's a more convenient > way to interact with the packets. > > -- > Matt Laswell > lasw...@infinite.io > > > 16:08:37.093306 IP 10.151.3.81.disclose > 10.151.3.161.nfsd: Flags [P.], > seq 3173509264:3173509380, ack 3244259549, win 580, options [nop,nop,TS val > 23060466 ecr 490971270], length 116: NFS request xid 2690728524 112 access > fh Unknown/8B6BFEBB0400CFABD1030100DABC05020100 > NFS_ACCESS_READ|NFS_ACCESS_LOOKUP|NFS_ACCESS_MODIFY|NFS_ > ACCESS_EXTEND|NFS_ACCESS_DELETE > 0x: 4500 00a8 6d0f 4000 4006 b121 0a97 0351 E...m.@.@..!...Q > 0x0010: 0a97 03a1 029b 0801 bd27 e890 c15f 78dd .'..._x. > 0x0020: 8018 0244 1cba 0101 080a 015f dff2 ...D._.. > 0x0030: 1d43 a086 8000 0070 a061 424c .C.p.aBL > 0x0040: 0002 0001 86a3 0003 0004 > 0x0050: 0001 0020 0107 8d2f 0007 .../ > 0x0060: 6573 7869 3275 3100 esxi2u1. > 0x0070: 0001 > 0x0080: 0020 8b6b febb 0400 cfab d103 .k.. > 0x0090: 0100 dabc 0502 > 0x00a0: 0100 001f > 16:08:37.095837 IP 10.151.3.161.nfsd > 10.151.3.81.disclose: Flags [P.], > seq 1:125, ack 116, win 28688, options [nop,nop,TS val 490971270 ecr > 23060466], length 124: NFS reply xid 2690728524 reply ok 120 access c 001f > 0x: 4500 00b0 1b80 4000 4006 02a9 0a97 03a1 E.@.@... > 0x0010: 0a97 0351 0801 029b c15f 78dd bd27 e904 ...Q._x..'.. > 0x0020: 8018 7010 a61a 0101 080a 1d43 a086 ..p..C.. > 0x0030: 015f dff2 8000 0078 a061 424c 0001 ._.x.aBL > 0x0040: > 0x0050: 0001 0002 01ed > 0x0060: 0003 > 0x0070: 0029 0800 00ff ...) > 0x0080: 00ff bbfe 6b8b 0001 ..k. > 0x0090: 03d1 abcf 5908 f554 3272 e4e6 5908 f554 Y..T2r..Y..T > 0x00a0: 3272 e4e6 5908 f554 3365 2612 001f 2r..Y..T3e&. > 16:08:37.096235 IP 10.151.3.81.disclose > 10.151.3.161.nfsd: Flags [P.], > seq 256:372, ack 285, win 589, options [nop,nop,TS val 23060467 ecr > 490971270], length 116: NFS request xid 2724282956 112 access fh Unknown/ > 8B6BFEBB0400D0ABD1030100DABC05020100 > NFS_ACCESS_READ|NFS_ACCESS_LOOKUP|NFS_ACCESS_MODIFY|NFS_ > ACCESS_EXTEND|NFS_ACCESS_DELETE > 0x: 4500 00a8 6d11 4000 4006 b11f 0a97 0351 E...m.@.@..Q > 0x0010: 0a97 03a1 029b 0801 bd27 e990 c15f 79f9 .'..._y. > 0x0020: 8018 024d 1cba 0101 080a 015f dff3 ...M._.. > 0x0030: 1d43 a086 8000 0070 a261 424c .C.p.aBL > 0x0040: 0002 0001 86a3 0003 0004 > 0x0050: 0001 0020 0107 8d2f 0007 .../ > 0x0060: 6573 7869 3275 3100 esxi2u1. > 0x0070: 0001 > 0x0080: 0020 8b6b febb 0400 d0ab d103 .k.. > 0x0090: 0100 dabc 0502 > 0x00a0: 0100 001f > 16:08:37.098361 IP 10.151.3.161.nfsd > 10.151.3.81.disclose: Flags [P.], > seq 285:409, ack 372, win 28688, options [nop,nop,TS val 490971270 ecr > 23060467], length 124: NFS reply xid 2724282956 reply ok 120 access c 001f > 0x: 4500 00b0 1b81 4000 4006 02a8 0a97 03a1 E.@.@... > 0x0010: 0a97 0351 0801 029b c15f 79f9 bd27 ea04 ...Q._y..'.. > 0x0020: 8018 7010 ec45 0101 080a 1d43 a086 ..p..E...C.. > 0x0030: 015f dff3 8000 0078 a261 424c 0001 ._.x.aBL > 0x0040: > 0x0050: 0001 0002 01ed > 0x0060: 0004 000
[dpdk-dev] backtracing from within the code
I've done something similar to what's described in the link below. But it's worth pointing out that it's using printf() inside a signal handler, which isn't safe. If your use case is catching SIGSEGV, for example, solutions built on printf() will usually work, but can deadlock. One way around the problem is to call write() directly, passing it stdout's file handle. For example, I have this in my code: #define WRITE_STRING(fd, s) write (fd, s, strlen (s)) In my signal handlers, I use the above like this: WRITE_STRING(STDOUT_FILENO, "Stack trace:\n"); This approach is a little bit more cumbersome to code, but safer. The last time that I looked the DPDK rte_dump_stack() is using vfprintf(), which isn't safe in a signal handler. However, it's been several DPDK releases since I peeked at the details. -- Matt Laswell Principal Software Engineer infinite io, inc. laswell at infinite.io On Sat, Jun 25, 2016 at 9:07 AM, Rosen, Rami wrote: > Hi, > If you are willing to skip static methods and use the GCC backtrace, you > can > try this example (it worked for me, but it was quite a time ago): > http://www.helicontech.co.il/?id=linuxbt > > Regards, > Rami Rosen > Intel Corporation > > -Original Message- > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Stephen Hemminger > Sent: Friday, June 24, 2016 8:46 PM > To: Thomas Monjalon > Cc: Catalin Vasile ; dev at dpdk.org; Dumitrescu, > Cristian > Subject: Re: [dpdk-dev] backtracing from within the code > > On Fri, 24 Jun 2016 12:05:26 +0200 > Thomas Monjalon wrote: > > > 2016-06-24 09:25, Dumitrescu, Cristian: > > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Catalin Vasile > > > > I'm trying to add a feature to DPDK and I'm having a hard time > printing a > > > > backtrace. > > > > I tried using this[1] functions for printing, but it does not print > more than one > > > > function. Maybe it lacks the symbols it needs. > > [...] > > > It eventually calls rte_dump_stack() in file > lib/lirte_eal/linuxapp/eal/eal_debug.c, which calls backtrace(), which is > probably what you are looking for. > > > > Example: > > 5: [build/app/testpmd(_start+0x29) [0x416f69]] > > 4: [/usr/lib/libc.so.6(__libc_start_main+0xf0) [0x7eff3b757610]] > > 3: [build/app/testpmd(main+0x2ff) [0x416b3f]] > > 2: [build/app/testpmd(init_port_config+0x88) [0x419a78]] > > 1: [build/lib/librte_eal.so.2.1(rte_dump_stack+0x18) [0x7eff3c126488]] > > > > Please tell us if you have some cases where rte_dump_stack() does not > work. > > I do not remember what are the constraints to have it working. > > Your binary is not stripped? > > The GCC backtrace doesn't work well because it can't find static functions. > I ended up using libunwind to get a better back trace. >
[dpdk-dev] Appropriate DPDK data structures for TCP sockets
Hey Matthew, I've mostly worked on stackless systems over the last few years, but I have done a fair bit of work on high performance, highly scalable connection tracking data structures. In that spirit, here are a few counterintuitive insights I've gained over the years. Perhaps they'll be useful to you. Apologies in advance for likely being a bit long-winded. First, you really need to take cache performance into account when you're choosing a data structure. Something like a balanced tree can seem awfully appealing at first blush, either on its own or as a chaining mechanism for a hash table. But the problem with trees is that there really isn't much locality of reference in your memory use - every single step in your descent ends up being a cache miss. This hurts you twice: once that you end up stalled waiting for the next node in the tree to load from main memory, and again when you have to reload whatever you pushed out of cache to get it. It's often better if, instead of a tree, you do linear search across arrays of hash values. It's easy to size the array so that it is exactly one cache line long, and you can generally do linear search of the whole thing in less time than it takes to do a single cache line fill. If you find a match, you can do full verification against the full tuple as needed. Second, rather than synchronizing (perhaps with locks, perhaps with lockless data structures), it's often beneficial to create multiple threads, each of which holds a fraction of your connection tracking data. Every connection belongs to a single one of these threads, selected perhaps by hash or RSS value, and all packets from the connection go through that single thread. This approach has a couple of advantages. First, obviously, no slowdowns for synchronization. But, second, I've found that when you are spreading packets from a single connection across many compute elements, you're inevitably going to start putting packets out of order. In many applications, this ultimately leads to some additional processing to put things back in order, which gives away the performance gains you achieved. Of course, this approach brings its own set of complexities, and challenges for your application, and doesn't always spread the work as efficiently across all of your cores. But it might be worth considering. Third, it's very worthwhile to have a cache for the most recently accessed connection. First, because network traffic is bursty, and you'll frequently see multiple packets from the same connection in succession. Second, because it can make life easier for your application code. If you have multiple places that need to access connection data, you don't have to worry so much about the cost of repeated searches. Again, this may or may not matter for your particular application. But for ones I've worked on, it's been a win. Anyway, as predicted, this post has gone far too long for a Monday morning. Regardless, I hope you found it useful. Let me know if you have questions or comments. -- Matt Laswell infinite io, inc. laswell at infiniteio.com On Sun, Feb 22, 2015 at 10:50 PM, Matthew Hall wrote: > > On Feb 22, 2015, at 4:02 PM, Stephen Hemminger > wrote: > > Use userspace RCU? or BSD RB_TREE > > Thanks Stephen, > > I think the RB_TREE stuff is single threaded mostly. > > But user-space RCU looks quite good indeed, I didn't know somebody ported > it out of the kernel. I'll check it out. > > Matthew.
[dpdk-dev] Question about link up/down events and transmit queues
Hey Folks, I'm running into an issue that I hope is obvious and simple. We're running DPDK 1.6.2 with an 82599 NIC. We find that if, while running traffic, we disconnect a port and then later reconnect it, we never regain the ability to transmit packets out of that port after it comes back up. Specifically, our calls to rte_eth_tx_burst() get return values that indicate that no packets could be sent. Is there an additional step that we have to do on link down/up operations, perhaps to tell the NIC to flush its descriptor ring? Thanks in advance for your help. -- Matt Laswell *infinite io, inc.* laswell at infiniteio.com
[dpdk-dev] Question about link up/down events and transmit queues
Just a bit more on this. We've found that when a link goes down, the TX descriptor ring appears to fill up with packets fairly quickly, and then calls to rte_eth_tx_burst() start returning zero. Our application handles this case, and frees the mbufs that could not be sent. However, when link is reestablished, the TX descriptor ring appears to stay full. Hence, subsequent calls to rte_eth_tx_burst() continue to return zero, and we continue to free the mbufs without sending them. Frankly, this was surprising, as we I have assumed that the TX descriptor ring would be emptied when the link came back up, either by sending the enqueued packets, or by reinitializing. I've tried calling rte_eth_dev_start() and rte_eth_promiscuous_enable() in order to restart everything. That appears to work, at least on the combination of drivers that I tested with. Can somebody please tell me whether this is the preferred way to recover from link down? Thanks, -- Matt Laswell *infinite io, inc.* laswell at infiniteio.com On Tue, Mar 10, 2015 at 10:47 AM, Matt Laswell wrote: > Hey Folks, > > I'm running into an issue that I hope is obvious and simple. We're > running DPDK 1.6.2 with an 82599 NIC. We find that if, while running > traffic, we disconnect a port and then later reconnect it, we never regain > the ability to transmit packets out of that port after it comes back up. > Specifically, our calls to rte_eth_tx_burst() get return values that > indicate that no packets could be sent. > > Is there an additional step that we have to do on link down/up operations, > perhaps to tell the NIC to flush its descriptor ring? > > Thanks in advance for your help. > > -- > Matt Laswell > *infinite io, inc.* > laswell at infiniteio.com >
[dpdk-dev] pktgen rx errors with intel 82599
Hi, I?ve been using DPDK pktgen 2.8.0 (built against DPDK 1.8.0 libraries) to send traffic on a server using an Intel 82599 (X520-2). Traffic gets sent out port 1 through another server which also an Intel 82599 installed and is forwarded back into port 0. When I send using a single source and destination IP address, this works fine and packets arrive on port 0 at close to the maximum line rate. If I change port 1 to range mode and send traffic from a range of source IP addresses to a single destination IP address, for a second or two the display indicates that some packets were received on port 0 but then the rate of received packets on the display goes to 0 and all incoming packets on port 0 are registered as rx errors. The server that traffic is being forwarded through is running the ip_pipeline example app. I ruled this out as the source of the problem by sending directly from port 1 to port 0 of the pktgen box. The issue still occurs when the traffic is not being forwarded through the other box. Since ip_pipeline is able to receive the packets and forward them without getting rx errors and it?s running with the same model of NIC as pktgen is using, I checked to see if there were any differences in initialization of the rx port between ip_pipeline and pktgen. I noticed that pktgen has a setting that ip_pipeline doesn't: const struct rte_eth_conf port_conf = { .rxmode = { .mq_mode = ETH_MQ_RX_RSS, If I comment out the .mq_mode setting and rebuild pktgen, the problem no longer occurs and I now receive packets on port 0 at near line rate when testing from a range of source addresses. I recall reading in the past that if a receive queue fills up on an 82599 , that receiving stalls for all of the other queues and no more packets can be received. Could that be happening with pktgen? Is there any debugging I can do to help track it down? The command line I have been launching pktgen with is: pktgen -c f -n 3 -m 512 -- -p 0x3 -P -m 1.0,2.1 Thanks, -Matt Smith
[dpdk-dev] Symmetric RSS Hashing, Part 2
Hey Folks, I have essentially the same question as Matthew. Has there been progress in this area? -- Matt Laswell infinite io, inc. laswell at infiniteio.com On Sat, Mar 14, 2015 at 3:47 PM, Matthew Hall wrote: > A few months ago we had this thread about symmetric hashing of TCP in RSS: > > http://dpdk.org/ml/archives/dev/2014-December/010148.html > > I was wondering if we ever did figure out how to get the 0x6d5a hash key > mentioned in there to work, or another alternative one. > > Thanks, > Matthew.
[dpdk-dev] pktgen rx errors with intel 82599
> On Mar 14, 2015, at 1:33 PM, Wiles, Keith wrote: > > Hi Matt, > > On 3/14/15, 8:47 AM, "Wiles, Keith" <mailto:keith.wiles at intel.com>> wrote: > >> Hi Matt >> >> On 3/13/15, 3:49 PM, "Matt Smith" wrote: >> >>> >>> Hi, >>> >>> I?ve been using DPDK pktgen 2.8.0 (built against DPDK 1.8.0 libraries) to >>> send traffic on a server using an Intel 82599 (X520-2). Traffic gets sent >>> out port 1 through another server which also an Intel 82599 installed and >>> is forwarded back into port 0. When I send using a single source and >>> destination IP address, this works fine and packets arrive on port 0 at >>> close to the maximum line rate. >>> >>> If I change port 1 to range mode and send traffic from a range of source >>> IP addresses to a single destination IP address, for a second or two the >>> display indicates that some packets were received on port 0 but then the >>> rate of received packets on the display goes to 0 and all incoming >>> packets on port 0 are registered as rx errors. >>> >>> The server that traffic is being forwarded through is running the >>> ip_pipeline example app. I ruled this out as the source of the problem by >>> sending directly from port 1 to port 0 of the pktgen box. The issue still >>> occurs when the traffic is not being forwarded through the other box. >>> Since ip_pipeline is able to receive the packets and forward them without >>> getting rx errors and it?s running with the same model of NIC as pktgen >>> is using, I checked to see if there were any differences in >>> initialization of the rx port between ip_pipeline and pktgen. I noticed >>> that pktgen has a setting that ip_pipeline doesn't: >>> >>> const struct rte_eth_conf port_conf = { >>> .rxmode = { >>> .mq_mode = ETH_MQ_RX_RSS, >>> >>> If I comment out the .mq_mode setting and rebuild pktgen, the problem no >>> longer occurs and I now receive packets on port 0 at near line rate when >>> testing from a range of source addresses. >>> >>> I recall reading in the past that if a receive queue fills up on an 82599 >>> , that receiving stalls for all of the other queues and no more packets >>> can be received. Could that be happening with pktgen? Is there any >>> debugging I can do to help track it down? >> >> I have seen this problem on some platforms a few times and it looks like >> you may have found a possible solution to the problem. I will have to look >> into the change and see if this is the problem, but it does seem to >> suggest this may be the issue. When the port gets into this state the port >> receives the number mbufs matching the number of descriptors and the rest >> are ?missed? frames at the wire. The RX counter is the number of missed >> frames. >> >> Thanks for the input >> ++Keith > > I added code to hopefully setup the correct RX/TX conf values. The HEAD of > the Pktgen-DPDK v2.8.4 should build and work with DPDK 1.8.0 or 2.0.0-rc1. > I did still see some RX errors and reduced bit rate, but the traffic does > not stop on my machine. Please give version 2.8.4 a try and let me know if > you still see problems. > > Regards, > ++Keith Hi Keith, Sorry for the delay in responding, I have been out of town. Thanks for your attention to the problem. I pulled the latest code from git and moved to the pktgen-2.8.4 tag. I had one issue building: CC pktgen-port-cfg.o /root/dpdk/pktgen-dpdk/app/pktgen-port-cfg.c: In function ?pktgen_config_ports?: /root/dpdk/pktgen-dpdk/app/pktgen-port-cfg.c:300:11: error: variable ?k? set but not used [-Werror=unused-but-set-variable] uint64_t k; ^ cc1: all warnings being treated as errors make[2]: *** [pktgen-port-cfg.o] Error 1 make[1]: *** [all] Error 2 make: *** [app] Error 2 I prepended '__attribute__((unused))? to the declaration of k and then I was able to build successfully. I did not see any receive errors running the updated binary. So once I got past the initial build problem, the issue seems to be resolved. Thanks, -Matt
[dpdk-dev] Symmetric RSS Hashing, Part 2
That's really encouraging. Thanks! One thing I'll note is that if my reading of the original paper is accurate, the 0x6d5a value isn't there in order to cause symmetry - other repeated 16 bit values will do that, as you've seen. What the 0x6d5a value gets you is symmetry while preserving RSS's effectiveness at load spreading with typical traffic data. Not all 16 bit values will do this. -- Matt Laswell infinite io, inc. laswell at infiniteio.com On Mon, Mar 30, 2015 at 10:00 AM, Vladimir Medvedkin wrote: > Matthew, > > I don't use any special tricks to make symmetric RSS work. Furthermore, it > works not only with 0x6d5a. > > Regards, > Vladimir > > 2015-03-28 23:11 GMT+03:00 Matthew Hall : > > > On Sat, Mar 28, 2015 at 12:10:20PM +0300, Vladimir Medvedkin wrote: > > > I just verify RSS symmetric in my code, all works great. > > > ... > > > By the way, maybe it will be usefull to add softrss function in DPDK? > > > > Vladimir, > > > > All of this is super-awesome code. I agree having SW RSS would be quite > > nice. > > Then you could more easily support things like virtio-net and other stuff > > which doesn't have RSS. > > > > Did you have to use any special tricks to get the 0x6d5a to work? I > wasn't > > quite > > sure how to initialize that and get it to run right. > > > > Matthew. > > >
[dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.
On Fri, May 8, 2015 at 4:19 PM, Ravi Kerur wrote: > This patch replaces memcmp in librte_hash with rte_memcmp which is > implemented with AVX/SSE instructions. > > +static inline int > +rte_memcmp(const void *_src_1, const void *_src_2, size_t n) > +{ > + const uint8_t *src_1 = (const uint8_t *)_src_1; > + const uint8_t *src_2 = (const uint8_t *)_src_2; > + int ret = 0; > + > + if (n & 0x80) > + return rte_cmp128(src_1, src_2); > + > + if (n & 0x40) > + return rte_cmp64(src_1, src_2); > + > + if (n & 0x20) { > + ret = rte_cmp32(src_1, src_2); > + n -= 0x20; > + src_1 += 0x20; > + src_2 += 0x20; > + } > > Pardon me for butting in, but this seems incorrect for the first two cases listed above, as the function as written will only compare the first 128 or 64 bytes of each source and return the result. The pattern expressed in the 32 byte case appears more correct, as it compares the first 32 bytes and then lets later pieces of the function handle the smaller remaining bits of the sources. Also, if this function is to handle arbitrarily large source data, the 128 byte case needs to be in a loop. What am I missing? -- Matt Laswell infinite io, inc. laswell at infiniteio.com
[dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.
On Fri, May 8, 2015 at 5:54 PM, Ravi Kerur wrote: > > > On Fri, May 8, 2015 at 3:29 PM, Matt Laswell > wrote: > >> >> >> On Fri, May 8, 2015 at 4:19 PM, Ravi Kerur wrote: >> >>> This patch replaces memcmp in librte_hash with rte_memcmp which is >>> implemented with AVX/SSE instructions. >>> >>> +static inline int >>> +rte_memcmp(const void *_src_1, const void *_src_2, size_t n) >>> +{ >>> + const uint8_t *src_1 = (const uint8_t *)_src_1; >>> + const uint8_t *src_2 = (const uint8_t *)_src_2; >>> + int ret = 0; >>> + >>> + if (n & 0x80) >>> + return rte_cmp128(src_1, src_2); >>> + >>> + if (n & 0x40) >>> + return rte_cmp64(src_1, src_2); >>> + >>> + if (n & 0x20) { >>> + ret = rte_cmp32(src_1, src_2); >>> + n -= 0x20; >>> + src_1 += 0x20; >>> + src_2 += 0x20; >>> + } >>> >>> >> Pardon me for butting in, but this seems incorrect for the first two >> cases listed above, as the function as written will only compare the first >> 128 or 64 bytes of each source and return the result. The pattern >> expressed in the 32 byte case appears more correct, as it compares the >> first 32 bytes and then lets later pieces of the function handle the >> smaller remaining bits of the sources. Also, if this function is to handle >> arbitrarily large source data, the 128 byte case needs to be in a loop. >> >> What am I missing? >> > > Current max hash key length supported is 64 bytes, hence no comparison is > done after 64 bytes. 128 bytes comparison is added to measure performance > only and there is no use-case as of now. With the current use-cases its not > required but if there is a need to handle large arbitrary data upto 128 > bytes it can be modified. > Ah, gotcha. I misunderstood and thought that this was meant to be a generic AVX/SSE enabled memcmp() replacement, and that the use of it in rte_hash was meant merely as a test case. If it's more limited than that, carry on, though you might want to make a note of it in the documentation. I suspect others will misinterpret the name as I did. -- Matt Laswell infinite io, inc. laswell at infiniteio.com
[dpdk-dev] Load-balancing position field in DPDK load_balancer sample app vs. Hash table
Hey Folks, This thread has been tremendously helpful, as I'm looking at adding RSS-based load balancing to my application in the not too distant future. Many thanks to all who have contributed, especially regarding symmetric RSS. Not to derail the conversation too badly, but could one of you point me to some example code that demonstrates the steps needed to configure RSS? We're using Niantic NICs, so I assume that this is pretty standard stuff, but having an example to study is a real leg up. Again, thanks for all of the information. -- Matt Laswell laswell at infiniteio.com infinite io, inc. On Fri, Nov 14, 2014 at 10:57 AM, Chilikin, Andrey < andrey.chilikin at intel.com> wrote: > Fortville supports symmetrical hashing on HW level, a patch for i40e PMD > was submitted a couple of weeks ago. For Niantic you can use symmetrical > rss key recommended by Konstantin. > > Regards, > Andrey > > -Original Message- > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Ananyev, Konstantin > Sent: Friday, November 14, 2014 4:50 PM > To: Yerden Zhumabekov; Kamraan Nasim; dev at dpdk.org > Cc: Yuanzhang Hu > Subject: Re: [dpdk-dev] Load-balancing position field in DPDK > load_balancer sample app vs. Hash table > > > -Original Message- > > From: Yerden Zhumabekov [mailto:e_zhumabekov at sts.kz] > > Sent: Friday, November 14, 2014 4:23 PM > > To: Ananyev, Konstantin; Kamraan Nasim; dev at dpdk.org > > Cc: Yuanzhang Hu > > Subject: Re: [dpdk-dev] Load-balancing position field in DPDK > > load_balancer sample app vs. Hash table > > > > I'd like to interject a question here. > > > > In case of flow classification, one might possibly prefer for packets > > from the same flow to fall on the same logical core. With this '%' > > load balancing, it would require to get the same RSS hash value for > > packets with direct (src to dst) and swapped (dst to src) IPs and > > ports. Am I correct that hardware RSS calculation cannot provide this > symmetry? > > As I remember, it is possible but you have to tweak rss key values. > Here is a paper describing how to do that: > http://www.ndsl.kaist.edu/~shinae/papers/TR-symRSS.pdf > > Konstantin > > > > > 14.11.2014 20:44, Ananyev, Konstantin ?: > > > If you have a NIC that is capable to do HW hash computation, then > > > you can do your load balancing based on that value. > > > Let say ixgbe/igb/i40e NICs can calculate RSS hash value based on > > > different combinations of dst/src Ips, dst/src ports. > > > This value can be stored inside mbuf for each RX packet by PMD RX > function. > > > Then you can do: > > > worker_id = mbuf->hash.rss % n_workersl > > > > > > That might to provide better balancing then using just one byte > > > value, plus should be a bit faster, as in that case your balancer code > don't need to touch packet's data. > > > > > > Konstantin > > > > -- > > Sincerely, > > > > Yerden Zhumabekov > > State Technical Service > > Astana, KZ > > > >
[dpdk-dev] Load-balancing position field in DPDK load_balancer sample app vs. Hash table
Fantastic. Thanks for the assist. -- Matt Laswell laswell at infiniteio.com infinite io, inc. On Sat, Nov 15, 2014 at 1:10 AM, Yerden Zhumabekov wrote: > Hello Matt, > > You can specify RSS configuration through rte_eth_dev_configure() function > supplied with this structure: > > struct rte_eth_conf port_conf = { > .rxmode = { > .mq_mode= ETH_MQ_RX_RSS, > ... > }, > .rx_adv_conf = { > .rss_conf = { > .rss_key = NULL, > .rss_hf = ETH_RSS_IPV4 | ETH_RSS_IPV6, > }, > }, > . > }; > > In this case, RSS-hash is calculated over IP addresses only and with > default RSS key. Look at lib/librte_ether/rte_ethdev.h for other > definitions. > > > 15.11.2014 0:49, Matt Laswell ?: > > Hey Folks, > > This thread has been tremendously helpful, as I'm looking at adding > RSS-based load balancing to my application in the not too distant future. > Many thanks to all who have contributed, especially regarding symmetric RSS. > > Not to derail the conversation too badly, but could one of you point me > to some example code that demonstrates the steps needed to configure RSS? > We're using Niantic NICs, so I assume that this is pretty standard stuff, > but having an example to study is a real leg up. > > Again, thanks for all of the information. > > -- > Matt Laswell > laswell at infiniteio.com > infinite io, inc. > > On Fri, Nov 14, 2014 at 10:57 AM, Chilikin, Andrey < > andrey.chilikin at intel.com> wrote: > >> Fortville supports symmetrical hashing on HW level, a patch for i40e PMD >> was submitted a couple of weeks ago. For Niantic you can use symmetrical >> rss key recommended by Konstantin. >> >> Regards, >> Andrey >> >> -Original Message- >> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Ananyev, Konstantin >> Sent: Friday, November 14, 2014 4:50 PM >> To: Yerden Zhumabekov; Kamraan Nasim; dev at dpdk.org >> Cc: Yuanzhang Hu >> Subject: Re: [dpdk-dev] Load-balancing position field in DPDK >> load_balancer sample app vs. Hash table >> >> > -Original Message- >> > From: Yerden Zhumabekov [mailto:e_zhumabekov at sts.kz] >> > Sent: Friday, November 14, 2014 4:23 PM >> > To: Ananyev, Konstantin; Kamraan Nasim; dev at dpdk.org >> > Cc: Yuanzhang Hu >> > Subject: Re: [dpdk-dev] Load-balancing position field in DPDK >> > load_balancer sample app vs. Hash table >> > >> > I'd like to interject a question here. >> > >> > In case of flow classification, one might possibly prefer for packets >> > from the same flow to fall on the same logical core. With this '%' >> > load balancing, it would require to get the same RSS hash value for >> > packets with direct (src to dst) and swapped (dst to src) IPs and >> > ports. Am I correct that hardware RSS calculation cannot provide this >> symmetry? >> >> As I remember, it is possible but you have to tweak rss key values. >> Here is a paper describing how to do that: >> http://www.ndsl.kaist.edu/~shinae/papers/TR-symRSS.pdf >> >> Konstantin >> >> > >> > 14.11.2014 20:44, Ananyev, Konstantin ?: >> > > If you have a NIC that is capable to do HW hash computation, then >> > > you can do your load balancing based on that value. >> > > Let say ixgbe/igb/i40e NICs can calculate RSS hash value based on >> > > different combinations of dst/src Ips, dst/src ports. >> > > This value can be stored inside mbuf for each RX packet by PMD RX >> function. >> > > Then you can do: >> > > worker_id = mbuf->hash.rss % n_workersl >> > > >> > > That might to provide better balancing then using just one byte >> > > value, plus should be a bit faster, as in that case your balancer >> code don't need to touch packet's data. >> > > >> > > Konstantin >> > >> > -- >> > Sincerely, >> > >> > Yerden Zhumabekov >> > State Technical Service >> > Astana, KZ >> > >> >> > > -- > Sincerely, > > Yerden Zhumabekov > State Technical Service > Astana, KZ > >
[dpdk-dev] capture packets on VM
Hey Raja, When you bind the ports to the DPDK poll mode drivers, the kernel no longer has visibility into them. This makes some sense intuitively - it would be very bad for both the kernel and a user mode application to both attempt to control the ports. This is why tools like tcpdump and wireshark don't work (and why the ports don't show up in ifconfig generally). If you just want to know that packets are flowing, an easy way to do it is simply to emit messages (via printf or the logging subsystem of your choice) or increment counters when you receive packets. If you want to verify a little bit of information about the packets but don't need full capture, you can either add some parsing information to your messages, or build out more stats. However, if you want to actually capture the packet contents, it's a little trickier. You can write your own packet-capture application, of course, but that might be a bigger task than you're looking for. You can also instantiate a KNI interface and either copy or forward the packets to it (and, from there, you can do tcpdump on the kernel side of the interface). I seem to recall that there's been some work done on tcpdump like applications within DPDK, but don't remember what state those efforts are in presently. -- Matt Laswell laswell at infinite.io infinite io, inc. On Fri, Jul 15, 2016 at 12:54 AM, Raja Jayapal wrote: > Hi All, > > I have installed dpdk on VM and would like to know how to capture the > packets on dpdk ports. > I am sending traffic from host and want to know how to confirm whether > the packets are flowing via dpdk ports. > I tried with tcpdump and wireshark but could not capture the packets > inside VM. > setup : bridge1(Host)--- VM(Guest with DPDK) - bridge2(Host) > > Please suggest. > > Thanks, > Raja > > =-=-= > Notice: The information contained in this e-mail > message and/or attachments to it may contain > confidential or privileged information. If you are > not the intended recipient, any dissemination, use, > review, distribution, printing or copying of the > information contained in this e-mail message > and/or attachments to it are strictly prohibited. If > you have received this communication in error, > please notify us by reply e-mail or telephone and > immediately and permanently delete the message > and any attachments. Thank you > > >
[dpdk-dev] Packet Cloning
Since Padam is going to be altering payload, he likely cannot use that API. The rte_pktmbuf_clone() API doesn't make a copy of the payload. Instead, it gives you a second mbuf whose payload pointer points back to the contents of the first (and also increments the reference counter on the first so that it isn't actually freed until all clones are accounted for). This is very fast, which is good. However, since there's only really one buffer full of payload, changes in the original also affect the clone and vice versa. This can have surprising and unpleasant side effects that may not show up until you are under load, which is awesome*. For what it's worth, if you need to be able to modify the copy while leaving the original alone, I don't believe that there's a good solution within DPDK. However, writing your own API to copy rather than clone a packet mbuf isn't difficult. -- Matt Laswell infinite io, inc. laswell at infiniteio.com * Don't ask me how I know how much awesome fun this can be, though I suspect you can guess. On Thu, May 28, 2015 at 9:52 AM, Stephen Hemminger < stephen at networkplumber.org> wrote: > On Thu, 28 May 2015 17:15:42 +0530 > Padam Jeet Singh wrote: > > > Hello, > > > > Is there a function in DPDK to completely clone a pkt_mbuf including the > segments? > > > > I am trying to build a packet mirroring application which sends packet > out through two separate interfaces, but the packet payload needs to be > altered before send. > > > > Thanks, > > Padam > > > > > > Isn't this what you want? > > /** > * Creates a "clone" of the given packet mbuf. > * > * Walks through all segments of the given packet mbuf, and for each of > them: > * - Creates a new packet mbuf from the given pool. > * - Attaches newly created mbuf to the segment. > * Then updates pkt_len and nb_segs of the "clone" packet mbuf to match > values > * from the original packet mbuf. > * > * @param md > * The packet mbuf to be cloned. > * @param mp > * The mempool from which the "clone" mbufs are allocated. > * @return > * - The pointer to the new "clone" mbuf on success. > * - NULL if allocation fails. > */ > static inline struct rte_mbuf *rte_pktmbuf_clone(struct rte_mbuf *md, > struct rte_mempool *mp) >
[dpdk-dev] Packet Cloning
Hey Kyle, That's one way you can handle it, though I suspect you'll end up with some complexity elsewhere in your code to deal with remembering whether you should look at the original data or the copied and modified data. Another way is just to make a copy of the original mbuf, but have your copy API stop after it reaches some particular point. Perhaps just the L2-L4 headers, perhaps a few hundred bytes into payload, or perhaps something else entirely. This all gets very application dependent, of course. How much is "enough" is going to depend heavily on what you're trying to accomplish. -- Matt Laswell infinite io, inc. laswell at infiniteio.com On Thu, May 28, 2015 at 10:38 AM, Kyle Larose wrote: > I'm fairly new to dpdk, so I may be completely out to lunch on this, but > here's an idea to possibly improve performance compared to a straight copy > of the entire packet. If this idea makes sense, perhaps it could be added > to the mbuf library as an extension of the clone functionality? > > If you are only modifying the headers (say the Ethernet header), is it > possible to make a copy of only the first N bytes (say 32 bytes)? > > For example, you make two new "main" mbufs, which contain duplicate > metadata, and a copy of the first 32 bytes of the packet. Call them A and > B. Have both A and B chain to the original mbuf (call it O), which is > reference counted as with the normal clone functionality. Then, you adjust > the O such that its start data is 32 bytes into the packet. > > When you transmit A, it will send its own copy of the 32 bytes, plus the > unaltered remaining data contained in O. A will be freed, and the refcount > of O decremented. When you transmit B, it will work the same as with the > previous one, except that when the refcount on O is decremented, it reaches > zero and it is freed as well. > > I'm not sure if this makes sense in all cases (for example, maybe it's > just faster to allocate separate mbufs for 64-byte packets). Perhaps that > could also be handled transparently underneath the hood. > > Thoughts? > > Thanks, > > Kyle > > On Thu, May 28, 2015 at 11:10 AM, Matt Laswell > wrote: > >> Since Padam is going to be altering payload, he likely cannot use that >> API. >> The rte_pktmbuf_clone() API doesn't make a copy of the payload. Instead, >> it gives you a second mbuf whose payload pointer points back to the >> contents of the first (and also increments the reference counter on the >> first so that it isn't actually freed until all clones are accounted for). >> This is very fast, which is good. However, since there's only really one >> buffer full of payload, changes in the original also affect the clone and >> vice versa. This can have surprising and unpleasant side effects that may >> not show up until you are under load, which is awesome*. >> >> For what it's worth, if you need to be able to modify the copy while >> leaving the original alone, I don't believe that there's a good solution >> within DPDK. However, writing your own API to copy rather than clone a >> packet mbuf isn't difficult. >> >> -- >> Matt Laswell >> infinite io, inc. >> laswell at infiniteio.com >> >> * Don't ask me how I know how much awesome fun this can be, though I >> suspect you can guess. >> >> On Thu, May 28, 2015 at 9:52 AM, Stephen Hemminger < >> stephen at networkplumber.org> wrote: >> >> > On Thu, 28 May 2015 17:15:42 +0530 >> > Padam Jeet Singh wrote: >> > >> > > Hello, >> > > >> > > Is there a function in DPDK to completely clone a pkt_mbuf including >> the >> > segments? >> > > >> > > I am trying to build a packet mirroring application which sends packet >> > out through two separate interfaces, but the packet payload needs to be >> > altered before send. >> > > >> > > Thanks, >> > > Padam >> > > >> > > >> > >> > Isn't this what you want? >> > >> > /** >> > * Creates a "clone" of the given packet mbuf. >> > * >> > * Walks through all segments of the given packet mbuf, and for each of >> > them: >> > * - Creates a new packet mbuf from the given pool. >> > * - Attaches newly created mbuf to the segment. >> > * Then updates pkt_len and nb_segs of the "clone" packet mbuf to match >> > values >> > * from the original packet mbuf. >> > * >> > * @param md >> > * The packet mbuf to be cloned. >> > * @param mp >> > * The mempool from which the "clone" mbufs are allocated. >> > * @return >> > * - The pointer to the new "clone" mbuf on success. >> > * - NULL if allocation fails. >> > */ >> > static inline struct rte_mbuf *rte_pktmbuf_clone(struct rte_mbuf *md, >> > struct rte_mempool *mp) >> > >> > >
[dpdk-dev] How to approach packet TX lockups
Hey Folks, I sent this to the users email list, but I'm not sure how many people are actively reading that list at this point. I'm dealing with a situation in which my application loses the ability to transmit packets out of a port during times of moderate stress. I'd love to hear suggestions for how to approach this problem, as I'm a bit at a loss at the moment. Specifically, I'm using DPDK 1.6r2 running on Ubuntu 14.04LTS on Haswell processors. I'm using the 82599 controller, configured to spread packets across multiple queues. Each queue is accessed by a different lcore in my application; there is therefore concurrent access to the controller, but not to any of the queues. We're binding the ports to the igb_uio driver. The symptoms I see are these: - All transmit out of a particular port stops - rte_eth_tx_burst() indicates that it is sending all of the packets that I give to it - rte_eth_stats_get() gives me stats indicating that no packets are being sent on the affected port. Also, no tx errors, and no pause frames sent or received (opackets = 0, obytes = 0, oerrors = 0, etc.) - All other ports continue to work normally - The affected port continues to receive packets without problems; only TX is affected - Resetting the port via rte_eth_dev_stop() and rte_eth_dev_start() restores things and packets can flow again - The problem is replicable on multiple devices, and doesn't follow one particular port I've tried calling rte_mbuf_sanity_check() on all packets before sending them. I've also instrumented my code to look for packets that have already been sent or freed, as well as cycles in chained packets being sent. I also put a lock around all accesses to rte_eth* calls to synchronize access to the NIC. Given some recent discussion here, I also tried changing the TX RS threshold from 0 to 32, 16, and 1. None of these strategies proved effective. Like I said at the top, I'm a little at a loss at this point. If you were dealing with this set of symptoms, how would you proceed? Thanks in advance. -- Matt Laswell infinite io, inc. laswell at infiniteio.com
[dpdk-dev] How to approach packet TX lockups
Hey Stephen, Thanks a lot; that's really useful information. Unfortunately, I'm at a stage in our release cycle where upgrading to a new version of DPDK isn't feasible. Any chance you (or others reading this) has a pointer to the relevant changes? While I can't afford to upgrade DPDK entirely, backporting targeted fixes is more doable. Again, thanks. - Matt On Mon, Nov 16, 2015 at 6:12 PM, Stephen Hemminger < stephen at networkplumber.org> wrote: > On Mon, 16 Nov 2015 17:48:35 -0600 > Matt Laswell wrote: > > > Hey Folks, > > > > I sent this to the users email list, but I'm not sure how many people are > > actively reading that list at this point. I'm dealing with a situation > in > > which my application loses the ability to transmit packets out of a port > > during times of moderate stress. I'd love to hear suggestions for how to > > approach this problem, as I'm a bit at a loss at the moment. > > > > Specifically, I'm using DPDK 1.6r2 running on Ubuntu 14.04LTS on Haswell > > processors. I'm using the 82599 controller, configured to spread packets > > across multiple queues. Each queue is accessed by a different lcore in > my > > application; there is therefore concurrent access to the controller, but > > not to any of the queues. We're binding the ports to the igb_uio driver. > > The symptoms I see are these: > > > > > >- All transmit out of a particular port stops > >- rte_eth_tx_burst() indicates that it is sending all of the packets > >that I give to it > >- rte_eth_stats_get() gives me stats indicating that no packets are > >being sent on the affected port. Also, no tx errors, and no pause > frames > >sent or received (opackets = 0, obytes = 0, oerrors = 0, etc.) > >- All other ports continue to work normally > >- The affected port continues to receive packets without problems; > only > >TX is affected > >- Resetting the port via rte_eth_dev_stop() and rte_eth_dev_start() > >restores things and packets can flow again > >- The problem is replicable on multiple devices, and doesn't follow > one > >particular port > > > > I've tried calling rte_mbuf_sanity_check() on all packets before sending > > them. I've also instrumented my code to look for packets that have > already > > been sent or freed, as well as cycles in chained packets being sent. I > > also put a lock around all accesses to rte_eth* calls to synchronize > access > > to the NIC. Given some recent discussion here, I also tried changing the > > TX RS threshold from 0 to 32, 16, and 1. None of these strategies proved > > effective. > > > > Like I said at the top, I'm a little at a loss at this point. If you > were > > dealing with this set of symptoms, how would you proceed? > > > > I remember some issues with old DPDK 1.6 with some of the prefetch > thresholds on 82599. You would be better off going to a later DPDK > version. >
[dpdk-dev] How to approach packet TX lockups
Yes, we're on 1.6r2. That said, I've tried a number of different values for the thresholds without a lot of luck. Setting wthresh/hthresh/pthresh to 0/0/32 or 0/0/0 doesn't appear to fix things. And, as Matthew suggested, I'm pretty sure using 0 for the thresholds leads to auto-config by the driver. I also tried 1/1/32, which required that I also change the rs_thresh value from 0 to 1 to work around a panic in PMD initialization ("TX WTHRESH must be set to 0 if tx_rs_thresh is greater than 1"). Any other suggestions? On Mon, Nov 16, 2015 at 7:31 PM, Stephen Hemminger < stephen at networkplumber.org> wrote: > On Mon, 16 Nov 2015 18:49:15 -0600 > Matt Laswell wrote: > > > Hey Stephen, > > > > Thanks a lot; that's really useful information. Unfortunately, I'm at a > > stage in our release cycle where upgrading to a new version of DPDK isn't > > feasible. Any chance you (or others reading this) has a pointer to the > > relevant changes? While I can't afford to upgrade DPDK entirely, > > backporting targeted fixes is more doable. > > > > Again, thanks. > > > > - Matt > > > > > > On Mon, Nov 16, 2015 at 6:12 PM, Stephen Hemminger < > > stephen at networkplumber.org> wrote: > > > > > On Mon, 16 Nov 2015 17:48:35 -0600 > > > Matt Laswell wrote: > > > > > > > Hey Folks, > > > > > > > > I sent this to the users email list, but I'm not sure how many > people are > > > > actively reading that list at this point. I'm dealing with a > situation > > > in > > > > which my application loses the ability to transmit packets out of a > port > > > > during times of moderate stress. I'd love to hear suggestions for > how to > > > > approach this problem, as I'm a bit at a loss at the moment. > > > > > > > > Specifically, I'm using DPDK 1.6r2 running on Ubuntu 14.04LTS on > Haswell > > > > processors. I'm using the 82599 controller, configured to spread > packets > > > > across multiple queues. Each queue is accessed by a different lcore > in > > > my > > > > application; there is therefore concurrent access to the controller, > but > > > > not to any of the queues. We're binding the ports to the igb_uio > driver. > > > > The symptoms I see are these: > > > > > > > > > > > >- All transmit out of a particular port stops > > > >- rte_eth_tx_burst() indicates that it is sending all of the > packets > > > >that I give to it > > > >- rte_eth_stats_get() gives me stats indicating that no packets > are > > > >being sent on the affected port. Also, no tx errors, and no pause > > > frames > > > >sent or received (opackets = 0, obytes = 0, oerrors = 0, etc.) > > > >- All other ports continue to work normally > > > >- The affected port continues to receive packets without problems; > > > only > > > >TX is affected > > > >- Resetting the port via rte_eth_dev_stop() and > rte_eth_dev_start() > > > >restores things and packets can flow again > > > >- The problem is replicable on multiple devices, and doesn't > follow > > > one > > > >particular port > > > > > > > > I've tried calling rte_mbuf_sanity_check() on all packets before > sending > > > > them. I've also instrumented my code to look for packets that have > > > already > > > > been sent or freed, as well as cycles in chained packets being > sent. I > > > > also put a lock around all accesses to rte_eth* calls to synchronize > > > access > > > > to the NIC. Given some recent discussion here, I also tried > changing the > > > > TX RS threshold from 0 to 32, 16, and 1. None of these strategies > proved > > > > effective. > > > > > > > > Like I said at the top, I'm a little at a loss at this point. If you > > > were > > > > dealing with this set of symptoms, how would you proceed? > > > > > > > > > > I remember some issues with old DPDK 1.6 with some of the prefetch > > > thresholds on 82599. You would be better off going to a later DPDK > > > version. > > > > > I hope you are on 1.6.0r2 at least?? > > With older DPDK there was no way to get driver to tell you what the > preferred settings were for pthresh/hthresh/wthresh. And the values > in Intel sample applications were broken on some hardware. > > I remember reverse engineering the safe values from reading the Linux > driver. > > The Linux driver is much better tested than the DPDK one... > In the Linux driver, the Transmit Descriptor Controller (txdctl) > is fixed at (for transmit) >wthresh = 1 >hthresh = 1 >pthresh = 32 > > The DPDK 2.2 driver uses: > wthresh = 0 > hthresh = 0 > pthresh = 32 > > > > > > >
[dpdk-dev] How to approach packet TX lockups
Hey Konstantin, Moving from 1.6r2 to 2.2 is going to be a pretty significant change due to things like changes in the MBuf format, API differences, etc. Even as an experiment, that's an awfully large change to absorb. Is there a subset that you're referring to that could be more readily included without modifying so many touch points into DPDK? For reference, my transmit function is rte_eth_tx_burst(). It seems to reliably tell me that it has enqueued all of the packets that I gave it, however the stats from rte_eth_stats_get() indicate that no packets are actually being sent. Thanks, - Matt On Tue, Nov 17, 2015 at 8:44 AM, Ananyev, Konstantin < konstantin.ananyev at intel.com> wrote: > > > > -Original Message- > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Matt Laswell > > Sent: Tuesday, November 17, 2015 2:24 PM > > To: Stephen Hemminger > > Cc: dev at dpdk.org > > Subject: Re: [dpdk-dev] How to approach packet TX lockups > > > > Yes, we're on 1.6r2. That said, I've tried a number of different values > > for the thresholds without a lot of luck. Setting > wthresh/hthresh/pthresh > > to 0/0/32 or 0/0/0 doesn't appear to fix things. And, as Matthew > > suggested, I'm pretty sure using 0 for the thresholds leads to > auto-config > > by the driver. I also tried 1/1/32, which required that I also change > the > > rs_thresh value from 0 to 1 to work around a panic in PMD initialization > > ("TX WTHRESH must be set to 0 if tx_rs_thresh is greater than 1"). > > > > Any other suggestions? > > That's not only DPDK code changed since 1.6. > I am pretty sure that we also have a new update of shared code since then > (and as I remember probably more than one). > One suggestion would be at least try to upgrade the shared code up to the > latest. > Another one - even if you can't upgrade to 2.2 in you production > environment, > it probably worth to do that in some test environment and then check does > the problem persist. > If yes, then we'll need some guidance how to reproduce it. > > Another question it is not clear what TX function do you use? > Konstantin > > > > > On Mon, Nov 16, 2015 at 7:31 PM, Stephen Hemminger < > > stephen at networkplumber.org> wrote: > > > > > On Mon, 16 Nov 2015 18:49:15 -0600 > > > Matt Laswell wrote: > > > > > > > Hey Stephen, > > > > > > > > Thanks a lot; that's really useful information. Unfortunately, I'm > at a > > > > stage in our release cycle where upgrading to a new version of DPDK > isn't > > > > feasible. Any chance you (or others reading this) has a pointer to > the > > > > relevant changes? While I can't afford to upgrade DPDK entirely, > > > > backporting targeted fixes is more doable. > > > > > > > > Again, thanks. > > > > > > > > - Matt > > > > > > > > > > > > On Mon, Nov 16, 2015 at 6:12 PM, Stephen Hemminger < > > > > stephen at networkplumber.org> wrote: > > > > > > > > > On Mon, 16 Nov 2015 17:48:35 -0600 > > > > > Matt Laswell wrote: > > > > > > > > > > > Hey Folks, > > > > > > > > > > > > I sent this to the users email list, but I'm not sure how many > > > people are > > > > > > actively reading that list at this point. I'm dealing with a > > > situation > > > > > in > > > > > > which my application loses the ability to transmit packets out > of a > > > port > > > > > > during times of moderate stress. I'd love to hear suggestions > for > > > how to > > > > > > approach this problem, as I'm a bit at a loss at the moment. > > > > > > > > > > > > Specifically, I'm using DPDK 1.6r2 running on Ubuntu 14.04LTS on > > > Haswell > > > > > > processors. I'm using the 82599 controller, configured to spread > > > packets > > > > > > across multiple queues. Each queue is accessed by a different > lcore > > > in > > > > > my > > > > > > application; there is therefore concurrent access to the > controller, > > > but > > > > > > not to any of the queues. We're binding the ports to the igb_uio > > > driver. > > > > > > The symptoms I see are these: > > > > > > >
[dpdk-dev] How to approach packet TX lockups
Thanks, I'll give that a try. In my environment, I'm pretty sure we're using the fully-featured ixgbe_xmit_pkts() and not _simple(). If setting rs_thresh=1 is safer, I'll stick with that. Again, thanks to all for the assistance. - Matt On Tue, Nov 17, 2015 at 10:20 AM, Ananyev, Konstantin < konstantin.ananyev at intel.com> wrote: > Hi Matt, > > > > As I said, at least try to upgrade contents of shared code to the latest > one. > > In previous releases: lib/librte_pmd_ixgbe/ixgbe, now located at: > drivers/net/ixgbe/. > > > > > For reference, my transmit function is rte_eth_tx_burst(). > > I meant what ixgbe TX function it points to: ixgbe_xmit_pkts or > ixgbe_xmit_pkts_simple()? > > For ixgbe_xmit_pkts_simple() don?t set tx_rs_thresh > 32, > > for ixgbe_xmit_pkts() the safest way is to set tx_rs_thresh=1. > > Though as I understand from your previous mails, you already did that, and > it didn?t help. > > Konstantin > > > > > > *From:* Matt Laswell [mailto:laswell at infiniteio.com] > *Sent:* Tuesday, November 17, 2015 3:05 PM > *To:* Ananyev, Konstantin > *Cc:* Stephen Hemminger; dev at dpdk.org > > *Subject:* Re: [dpdk-dev] How to approach packet TX lockups > > > > Hey Konstantin, > > > > Moving from 1.6r2 to 2.2 is going to be a pretty significant change due to > things like changes in the MBuf format, API differences, etc. Even as an > experiment, that's an awfully large change to absorb. Is there a subset > that you're referring to that could be more readily included without > modifying so many touch points into DPDK? > > > > For reference, my transmit function is rte_eth_tx_burst(). It seems to > reliably tell me that it has enqueued all of the packets that I gave it, > however the stats from rte_eth_stats_get() indicate that no packets are > actually being sent. > > > > Thanks, > > > > - Matt > > > > On Tue, Nov 17, 2015 at 8:44 AM, Ananyev, Konstantin < > konstantin.ananyev at intel.com> wrote: > > > > > -Original Message- > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Matt Laswell > > Sent: Tuesday, November 17, 2015 2:24 PM > > To: Stephen Hemminger > > Cc: dev at dpdk.org > > Subject: Re: [dpdk-dev] How to approach packet TX lockups > > > > Yes, we're on 1.6r2. That said, I've tried a number of different values > > for the thresholds without a lot of luck. Setting wthresh/hthresh/ > pthresh > > to 0/0/32 or 0/0/0 doesn't appear to fix things. And, as Matthew > > suggested, I'm pretty sure using 0 for the thresholds leads to auto- > config > > by the driver. I also tried 1/1/32, which required that I also change > the > > rs_thresh value from 0 to 1 to work around a panic in PMD initialization > > ("TX WTHRESH must be set to 0 if tx_rs_thresh is greater than 1"). > > > > Any other suggestions? > > That's not only DPDK code changed since 1.6. > I am pretty sure that we also have a new update of shared code since then > (and as I remember probably more than one). > One suggestion would be at least try to upgrade the shared code up to the > latest. > Another one - even if you can't upgrade to 2.2 in you production > environment, > it probably worth to do that in some test environment and then check does > the problem persist. > If yes, then we'll need some guidance how to reproduce it. > > Another question it is not clear what TX function do you use? > Konstantin > > > > > > On Mon, Nov 16, 2015 at 7:31 PM, Stephen Hemminger < > > stephen at networkplumber.org> wrote: > > > > > On Mon, 16 Nov 2015 18:49:15 -0600 > > > Matt Laswell wrote: > > > > > > > Hey Stephen, > > > > > > > > Thanks a lot; that's really useful information. Unfortunately, I'm > at a > > > > stage in our release cycle where upgrading to a new version of DPDK > isn't > > > > feasible. Any chance you (or others reading this) has a pointer to > the > > > > relevant changes? While I can't afford to upgrade DPDK entirely, > > > > backporting targeted fixes is more doable. > > > > > > > > Again, thanks. > > > > > > > > - Matt > > > > > > > > > > > > On Mon, Nov 16, 2015 at 6:12 PM, Stephen Hemminger < > > > > stephen at networkplumber.org> wrote: > > > > > > > > > On Mon, 16 Nov 2015 17:48:35 -0600 > > > > > Matt Laswell wrote: >
[dpdk-dev] DPDK Port Mirroring
Keith speaks truth. If I were going to do what you're describing, I would do the following: 1. Start with the l2fwd example application. 2. Remove the part where it modifies the ethernet MAC address of received packets. 3. Add a call in to clone mbufs via rte_pktmbuf_clone() and send the cloned packets out of the port of your choice As long as you don't need to modify the packets - and if you're mirroring, you shouldn't - simply cloning received packets and sending them out your mirror port should get you most of the way there. On Thu, Jul 9, 2015 at 3:17 PM, Wiles, Keith wrote: > > > On 7/9/15, 12:26 PM, "dev on behalf of Assaad, Sami (Sami)" > > wrote: > > >Hello, > > > >I want to build a DPDK app that is able to port-mirror all ingress > >traffic from two 10G interfaces. > > > >1. Is it possible in port-mirroring traffic consisting of 450byte > >packets at 20G without losing more than 5% of traffic? > > > >2. Would you have any performance results due to packet copying? > > Do you need to copy the packet if you increment the reference count you > can send the packet to both ports without having to copy the packet. > > > >3. Would you have any port mirroring DPDK sample code? > > DPDK does not have port mirroring example, but you could grab the l2fwd or > l3fwd and modify it to do what you want. > > > >Thanks in advance. > > > >Best Regards, > >Sami Assaad. > >
[dpdk-dev] Kernel panic in KNI
Hey Robert, Thanks for the insight. I work with Jay on the code he's asking about; we only have one mbuf pool that we use for all packets. Mostly, this is for the reasons that you describe, as well as for the sake of simplicity. As it happens, the stack trace we're seeing makes it look as though either the mbuf's data pointer is screwed up, or the VA translation done on it is. I suspect that we're getting to a failure mode similar to the one you experienced, though perhaps for different reasons. Thanks, Matt On Wed, Apr 6, 2016 at 5:30 PM, Sanford, Robert wrote: > Hi Jay, > > I won't try to interpret your kernel stack trace. But, I'll tell you about > a KNI-related problem that we once experienced, and the symptom was a > kernel hang. > > The problem was that we were passing mbufs allocated out of one mempool, > to a KNI context that we had set up with a different mempool (on a > different CPU socket). The KNI kernel driver, converts the user-space mbuf > virtual address (VA) to a kernel VA by adding the difference between the > user and kernel VAs of the mempool used to create the KNI context. So, if > an mbuf comes from a different mempool, the calculated address will > probably be VERY BAD. > > Could this be your problem? > > -- > Robert > > > On 4/6/16 4:16 PM, "Jay Rolette" wrote: > > >I had a system lockup hard a couple of days ago and all we were able to > >get > >was a photo of the LCD monitor with most of the kernel panic on it. No way > >to scroll back the buffer and nothing in the logs after we rebooted. Not > >surprising with a kernel panic due to an exception during interrupt > >processing. We have a serial console attached in case we are able to get > >it > >to happen again, but it's not easy to reproduce (hours of runtime for this > >instance). > > > >Ran the photo through OCR software to get a text version of the dump, so > >possible I missed some fixups in this: > > > >[39178.433262] RDX: 00ba RSI: 881fd2f350ee RDI: > >a12520669126180a > >[39178.464020] RBP: 880433966970 R08: a12520669126180a R09: > >881fd2f35000 > >[39178.495091] R10: R11: 881fd2f88000 R12: > >883fdla75ee8 > >[39178.526594] R13: 00ba R14: 7fdad5a66780 R15: > >883715ab6780 > >[39178.559011] FS: 77fea740() GS:88lfffc0() > >knlGS: > >[39178.592005] CS: 0010 DS: ES: CR0: 80050033 > >[39178.623931] CR2: 77ea2000 CR3: 001fd156f000 CR4: > >001407f0 > >[39178.656187] Stack: > >[39178.689025] c067c7ef 00ba 00ba > >881fd2f88000 > >[39178.722682] 4000 8B3fd0bbd09c 883fdla75ee8 > >8804339bb9c8 > >[39178.756525] 81658456 881fcd2ec40c c0680700 > >880436bad800 > >[39178.790577] Call Trace: > >[39178.824420] [] ? kni_net_tx+0xef/0x1a0 [rte_kni] > >[39178.859190] [] dev_hard_start_xmit+0x316/0x5c0 > >[39178.893426] [] sch_direct_xmit+0xee/0xic0 > >[39178.927435] [l __dev_queue_xmit+0x200/0x4d0 > >[39178.961684] [l dev_queue_xmit+0x10/0x20 > >[39178.996194] [] neigh_connected_output+0x67/0x100 > >[39179.031098] [] ip_finish_output+0xid8/0x850 > >[39179.066709] [l ip_output+0x58/0x90 > >[39179.101551] [] ip_local_out_sk+0x30/0x40 > >[39179.136823] [] ip_queue_xmit+0xl3f/0x3d0 > >[39179.171742] [] tcp_transmit_skb+0x47c/0x900 > >[39179.206854] [l tcp_write_xmit+0x110/0xcb0 > >[39179.242335] [] __tcp_push_pending_frames+0x2e/0xc0 > >[39179.277632] [] tcp_push+0xec/0x120 > >[39179.311768] [] tcp_sendmsg+0xb9/0xce0 > >[39179.346934] [] ? tcp_recvmsg+0x6e2/0xba0 > >[39179.385586] [] inet_sendmsg+0x64/0x60 > >[39179.424228] [] ? apparmor_socket_sendmsg+0x21/0x30 > >[39179.4586581 [] sock_sendmsg+0x86/0xc0 > >[39179.493220] [] ? __inet_stream_connect+0xa5/0x320 > >[39179.528033] [] ? __fdget+0x13/0x20 > >[39179.561214] [] SYSC_sendto+0x121/0x1c0 > >[39179.594665] [] ? aa_sk_perm.isra.4+0x6d/0x150 > >[39179.6268931 [] ? read_tsc+0x9/0x20 > >[39179.6586541 [] ? ktime_get_ts+0x48/0xe0 > >[39179.689944] [] SyS_sendto+0xe/0x10 > >[39179.719575] [] system_call_fastpath+0xia/0xif > >[39179.748760] Code: 43 58 48 Zb 43 50 88 43 4e 5b 5d c3 66 Of if 84 00 00 > >00 00 00 e8 fb fb ff ff eb e2 90 90 90 90 90 90 90 > > 90 48 89 f8 48 89 d1 a4 c3 03 83 eZ 07 f3 48 .15 89 di f3 a4 c3 20 > >4c > >8b % 4c 86 > >[39179.808690] RIP [] memcpy+0x6/0x110 > >[39179.837238] RSP > >[39179.933755] ---[ end trace
Re: [dpdk-dev] [PATCH 1/2] test: replace license text with SPDX tag
> -Original Message- > From: Legacy, Allain > Sent: Tuesday, August 13, 2019 8:20 AM > To: hemant.agra...@nxp.com > Cc: dev@dpdk.org; john.mcnam...@intel.com; marko.kovace...@intel.com; > cristian.dumitre...@intel.com; Peters, Matt > Subject: [PATCH 1/2] test: replace license text with SPDX tag > > Replacing full license text with SPDX tag. > > Signed-off-by: Allain Legacy > --- Acked-by: Matt Peters
Re: [dpdk-dev] [PATCH 2/2] doc: replace license text with SPDX tag
> -Original Message- > From: Legacy, Allain > Sent: Tuesday, August 13, 2019 8:20 AM > To: hemant.agra...@nxp.com > Cc: dev@dpdk.org; john.mcnam...@intel.com; marko.kovace...@intel.com; > cristian.dumitre...@intel.com; Peters, Matt > Subject: [PATCH 2/2] doc: replace license text with SPDX tag > > Replace full license text with SPDX tag. > > Signed-off-by: Allain Legacy Acked-by: Matt Peters
Re: [dpdk-dev] [PATCH v3] net/avp: remove resources when port is closed
> -Original Message- > From: Legacy, Allain > Sent: Tuesday, June 18, 2019 3:19 PM > To: tho...@monjalon.net > Cc: dev@dpdk.org; ferruh.yi...@intel.com; Peters, Matt > Subject: [PATCH v3] net/avp: remove resources when port is closed > > The rte_eth_dev_close() function now handles freeing resources for > devices (e.g., mac_addrs). To conform with the new close() behaviour we > are asserting the RTE_ETH_DEV_CLOSE_REMOVE flag so that > rte_eth_dev_close() releases all device level dynamic memory. > > Second level memory allocated to each individual rx/tx queue is now > freed as part of the close() operation therefore making it safe for the > rte_eth_dev_close() function to free the device private data without > orphaning the rx/tx queue pointers. > > Cc: Matt Peters > Signed-off-by: Allain Legacy Acked-by: Matt Peters
Re: [dpdk-dev] [PATCH v2] net/avp: remove resources when port is closed
> -Original Message- > From: Legacy, Allain > Sent: Monday, May 27, 2019 1:03 PM > To: tho...@monjalon.net > Cc: dev@dpdk.org; ferruh.yi...@intel.com; Peters, Matt > Subject: [PATCH v2] net/avp: remove resources when port is closed > > The rte_eth_dev_close() function now handles freeing resources for > devices (e.g., mac_addrs). To conform with the new close() behaviour we > are asserting the RTE_ETH_DEV_CLOSE_REMOVE flag so that > rte_eth_dev_close() releases all device level dynamic memory. > > Second level memory allocated to each individual rx/tx queue is now > freed as part of the close() operation therefore making it safe for the > rte_eth_dev_close() function to free the device private data without > orphaning the rx/tx queue pointers. > > Cc: Matt Peters > Signed-off-by: Allain Legacy > --- Acked-by: Matt Peters
[dpdk-dev] [dpdk-moving] Draft Project Charter
I think we need a discussion about the levels of membership - possibly at next weeks meeting? My feeling is that we need more than one level - One to enable contribution of hardware to the lab, as the lab will add cost to the overall project budget - A second to enable contribution to the marketing aspects of the project and to allow association for marketing purposes Calling these Gold and Silver is fine with me, but as I say, lets discuss this at next weeks meeting. Matt From: moving on behalf of O'Driscoll, Tim Sent: 08 November 2016 03:57:36 To: Vincent JARDIN; moving at dpdk.org Cc: dev at dpdk.org Subject: Re: [dpdk-moving] [dpdk-dev] Draft Project Charter > -Original Message- > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Vincent JARDIN > Sent: Tuesday, November 8, 2016 11:41 AM > To: moving at dpdk.org > Cc: dev at dpdk.org > Subject: Re: [dpdk-dev] [dpdk-moving] Draft Project Charter > > Tim, > > Thanks for your draft, but it is not a good proposal. It is not written > in the spirit that we have discussed in Dublin: >- you create the status of "Gold" members that we do not want from > Linux Foundation, As I said in the email, I put in two levels of membership as a placeholder. The first thing we need to decide is if we want to have a budget and membership, or if we want the OVS model with 0 budget and no membership. We can discuss that at today's meeting. If we do want a membership model then we'll need to decide if everybody contributes at the same rate or if we support multiple levels. So, for now, the text on having two levels is just an example to show what a membership model might look like. >- you start with "DPDK's first $1,000,000", it is far from the $O > that we agreed based on OVS model. That's just standard text that I see in all the LF charters. It's even in the OVS charter (http://openvswitch.org/charter/charter.pdf) even though they have 0 budget. I assumed it's standard text for the LF. I'm sure Mike Dolan can clarify. > > Please, explain why you did change it? > > Thank you, >Vincent IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
[dpdk-dev] Running kni with low amount of cores
Hello, I have two NIC devices and a quad core system that I'm trying to run kni on. I would like to leave two cores for general use and two cores for kni. When run kni on just one of the ports, everything works fine and I can use that vEth normally. The exact command I run is this: ./kni -c 0x0c -n 2 -- -P -p 0x1 -config="(0,2,3)" But when I try to run kni on both ports, I can't find a configuration to make it work. Here's all the configs that I have tried, but none of them seem to work properly, the same way as just a single port: "(0,2,3), (1,2,3)" "(0,2,3), (1,3,2)""(0,2,2), (1,3,3)". I'm wondering if it is supposed to work this way, where each port needs its own Tx and Rx core, or if there is a way to get around it. If it is supposed to work this way, would it be worth my time to edit the code to allow me to have all Rx information dealt with on one core and all Tx on another? Thanks, Matt Olson