Hi Gowrishankar, Sorry for late response.
On 5/18/2017 6:45 PM, gowrishankar muthukrishnan wrote: > On Tuesday 16 May 2017 10:45 PM, Ferruh Yigit wrote: >> On 5/11/2017 12:51 PM, Gowrishankar wrote: >>> From: Gowrishankar Muthukrishnan <gowrishanka...@linux.vnet.ibm.com> >>> >>> In kni_allocate_mbufs(), we attempt to add max_burst (32) count of mbuf >>> always into alloc_q, which is excessively leading too many rte_pktmbuf_ >>> free() when alloc_q is contending at high packet rate (for eg 10Gig data). >>> In a situation when alloc_q fifo can only accommodate very few (or zero) >>> mbuf, create only what needed and add in fifo. >> I remember I have tried similar, also tried allocating amount of >> nb_packets read from kernel, both produced worse performance. >> Can you please share your before/after performance numbers? > Sure Ferruh, please find below comparison of call counts I set at two places > along with additional stat on kni egress for more than one packet in txq > burst read, > as in pseudo code below: > > @@ -589,8 +592,12 @@ rte_kni_rx_burst(struct rte_kni *kni, struct > rte_mbuf **mbufs, unsigned num) > unsigned ret = kni_fifo_get(kni->tx_q, (void **)mbufs, num); > > /* If buffers removed, allocate mbufs and then put them into > alloc_q */ > if (ret) { > ++alloc_call; > if (ret > 1) > alloc_call_mt1tx += ret; > kni_allocate_mbufs(kni); > } > > return ret; > } > @@ -659,6 +666,7 @@ kni_allocate_mbufs(struct rte_kni *kni) > if (ret >= 0 && ret < i && ret < MAX_MBUF_BURST_NUM) { > int j; > > freembuf_call += (i-ret); > for (j = ret; j < i; j++) > rte_pktmbuf_free(pkts[j]); > > > >> kni_allocate_mbufs() called within rte_kni_rx_burst() if any packet >> received from kernel. If there is a heavy traffic, kernel will always >> consume the alloc_q before this function called and this function will >> fill it back. So there shouldn't be much cases that alloc_q fifo already >> full. >> Perhaps this can happen if application burst Rx from kernel in a number >> less than 32, but fifo filled with fixed 32mbufs, is this your case? > > I think some resemblance to this case based on below stats. W/o patch, > application > would spend its most of processing in freeing mbufs right ?. > >> >> Can you measure number of times rte_pktmbuf_free() called because of >> alloc_q is full? > > I have sampled below data in x86_64 for KNI on ixgbe pmd. iperf server > runs on > remote interface connecting PMD and iperf client runs on KNI interface, > so as to > create more egress from KNI into DPDK (w/o and with this patch) for 1MB and > 100MB data. rx and tx stats are from kni app (USR1). > > 100MB w/o patch 1.28Gbps > rx tx alloc_call alloc_call_mt1tx freembuf_call > 3933 72464 51042 42472 1560540 Some math: alloc called 51042 times with allocating 32 mbufs each time, 51042 * 32 = 1633344 freed mbufs: 1560540 used mbufs: 1633344 - 1560540 = 72804 72804 =~ 72464, so looks correct. Which means rte_kni_rx_burst() called 51042 times and 72464 buffers received. As you already mentioned, for each call kernel able to put only 1-2 packets into the fifo. This number is close to 3 for my test with KNI PMD. And for this case, agree your patch looks reasonable. But what if kni has more egress traffic, that able to put >= 32 packets between each rte_kni_rx_burst()? For that case this patch introduces extra cost to get allocq_free count. Overall I am not disagree with patch, but I have concern if this would cause performance loss some cases while making better for this one. That would help a lot if KNI users test and comment. For me, applying patch didn't give any difference in final performance numbers, but if there is no objection, I am OK to get this patch. > > 1MB w/o patch 204Mbps > rx tx alloc_call alloc_call_mt1tx freembuf_call > 84 734 566 330 17378 > > 100MB w/ patch 1.23Gbps > rx tx alloc_call alloc_call_mt1tx freembuf_call > 4258 72466 72466 0 0 > > 1MB w/ patch 203Mbps > rx tx alloc_call alloc_call_mt1tx freembuf_call > 76 734 733 2 0 > > With patch, KNI egress on txq seems to be almost only one packet at a time > (and in 1MB test, a rare instance of more than 2 packets seen even > though it is > burst read). Also, as it is one mbuf consumed by module and added by lib at > a time, rte_pktmbuf_free is not called at all, due to right amount (1 or 2) > of mbufs enqueued in alloc_q. > > This controlled enqueue on alloc_q avoids nw stall for i40e in ppc64le. > Could you > please check if i40e is able to handle data at order of 10GiB in your > arch, as I see > that, network stalls at some random point w/o this patch. > > Thanks, > Gowrishankar > >>> With this patch, we could stop random network stall in KNI at higher packet >>> rate (eg 1G or 10G data between vEth0 and PMD) sufficiently exhausting >>> alloc_q on above condition. I tested i40e PMD for this purpose in ppc64le. >> If stall happens from NIC to kernel, this is kernel receive path, and >> alloc_q is in kernel transmit path. >> >>> Changes: >>> v2 - alloc_q free count calculation corrected. >>> line wrap fixed for commit message. >>> >>> Signed-off-by: Gowrishankar Muthukrishnan >>> <gowrishanka...@linux.vnet.ibm.com> >>> --- >>> lib/librte_kni/rte_kni.c | 5 ++++- >>> 1 file changed, 4 insertions(+), 1 deletion(-) >>> >>> diff --git a/lib/librte_kni/rte_kni.c b/lib/librte_kni/rte_kni.c >>> index c3f9208..9c5d485 100644 >>> --- a/lib/librte_kni/rte_kni.c >>> +++ b/lib/librte_kni/rte_kni.c >>> @@ -624,6 +624,7 @@ struct rte_kni * >>> int i, ret; >>> struct rte_mbuf *pkts[MAX_MBUF_BURST_NUM]; >>> void *phys[MAX_MBUF_BURST_NUM]; >>> + int allocq_free; >>> >>> RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pool) != >>> offsetof(struct rte_kni_mbuf, pool)); >>> @@ -646,7 +647,9 @@ struct rte_kni * >>> return; >>> } >>> >>> - for (i = 0; i < MAX_MBUF_BURST_NUM; i++) { >>> + allocq_free = (kni->alloc_q->read - kni->alloc_q->write - 1) \ >>> + & (MAX_MBUF_BURST_NUM - 1); >>> + for (i = 0; i < allocq_free; i++) { >>> pkts[i] = rte_pktmbuf_alloc(kni->pktmbuf_pool); >>> if (unlikely(pkts[i] == NULL)) { >>> /* Out of memory */ >>> > >