> From: Pravin Shelar [mailto:pshe...@nicira.com] > > On Wed, Jun 17, 2015 at 3:17 AM, Gray, Mark D <mark.d.g...@intel.com> > wrote: > > > > > >> -----Original Message----- > >> From: Daniele Di Proietto [mailto:diproiet...@vmware.com] > >> Sent: Tuesday, June 16, 2015 6:45 PM > >> To: Pravin Shelar; Wei li; Gray, Mark D > >> Cc: d...@openvswitch.com > >> Subject: Re: [ovs-dev] [PATCH v2] Do not flush tx queue which is > >> shared among CPUs since it is always flushed > >> > >> > >> > >> On 16/06/2015 07:40, "Pravin Shelar" <pshe...@nicira.com> wrote: > >> > >> >On Mon, Jun 8, 2015 at 7:42 PM, Pravin Shelar <pshe...@nicira.com> > wrote: > >> >> On Mon, Jun 8, 2015 at 6:13 PM, Wei li <l...@dtdream.com> wrote: > >> >>> When tx queue is shared among CPUS,the pkts always be flush in > >> >>>'netdev_dpdk_eth_send' > >> >>> So it is unnecessarily for flushing in netdev_dpdk_rxq_recv > >> >>>Otherwise tx will be accessed without locking > >> >>> > >> >>> Signed-off-by: Wei li <l...@dtdream.com> > >> >>> --- > >> >>> v1->v2: fix typo > >> >>> > >> >>> lib/netdev-dpdk.c | 7 +++++-- > >> >>> 1 file changed, 5 insertions(+), 2 deletions(-) > >> >>> > >> >>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index > >> >>> 63243d8..1e0a483 100644 > >> >>> --- a/lib/netdev-dpdk.c > >> >>> +++ b/lib/netdev-dpdk.c > >> >>> @@ -892,8 +892,11 @@ netdev_dpdk_rxq_recv(struct netdev_rxq > >> *rxq_, > >> >>>struct dp_packet **packets, > >> >>> int nb_rx; > >> >>> > >> >>> /* There is only one tx queue for this core. Do not flush other > >> >>> - * queueus. */ > >> >>> - if (rxq_->queue_id == rte_lcore_id()) { > >> >>> + * queues. > >> >>> + * Do not flush tx queue which is shared among CPUs > >> >>> + * since it is always flushed */ > >> >>> + if (rxq_->queue_id == rte_lcore_id() && > >> >>> + OVS_LIKELY(!dev->txq_needs_locking)) { > >> >>> dpdk_queue_flush(dev, rxq_->queue_id); > >> >>> } > >> >> > >> >> Patch looks good, But Daniele has plan to get rid of queue > >> >> flushing logic. do you see problem with doing this? > >> > > >> >Daniele, > >> >I am not sure if we should fix this queue flushing logic if we are > >> >going to remove it. So I would like to discuss it first. > >> >You mentioned that the removing flushing logic results in > >> >performance drop. Can you explain it? How much is performance drop > >> >and which is the case? > >> > >> Hi Pravin, > >> > >> sorry for the late reply. I suspected that removing the queues in > >> netdev- dpdk was going to have a performance impact in the following > >> scenario: a batch of packets from different megaflows with the same > >> action (output on port 1). > >> Here's what happens: > >> > >> 1) With the queues in netdev-dpdk. dpif-netdev calls netdev_send() > >> for each > >> packet. netdev_dpdk_send() just copies the pointer in the queue. > Before > >> receiving the next batch dpdk_queue_flush() call rte_eth_tx_burst() > with > >> a whole batch of packets > >> 2) Without the queues in netdev-dpdk. dpif-netdev calls netdev_send() > >> for each > >> packet. netdev_dpdk_send() calls rte_eth_tx_burst() for each packet. > >> > >> Scenario 2 could be slower because rte_eth_tx_burst() is called for > >> each packet (instead of each batch). After some testing this turns > >> out not to be the case. > >> It appears that the bottleneck is not rte_eth_tx_burst(), and copying > >> the pointers in the temporary queue makes the code slower. > >> > >> What do you think? Should we just remove the queues? > >> CC'ing Mark since he expressed interest in the issue > > > > Every call to rte_eth_tx_burst() will generate an expensive MMIO > > write. Best practice would be to burst as many packets as you can to > > amortize the cost of the MMIO write. I am surprised at your findings > > for (2) above. Maybe in this case the cost of queuing is greater than the > MMIO write? > > > > Hi Mark, > Can you explain your test case where you do see performance flow down if > there is no queuing in netdev-dpdk?
Hi Pravin, I have seen this before from work we did in OVDK. However to help justify this assertion here, I have tried to reproduce with DPDK and OVS. I was able to reproduce using the DPDK l2fwding app using the following patch. I also had to disable the vector pmd to allow me to receive mbufs 1 at a time (CONFIG_RTE_IXGBE_INC_VECTOR=n in the DPDK .config file) --- a/examples/l2fwd/main.c +++ b/examples/l2fwd/main.c @@ -222,11 +222,17 @@ l2fwd_send_packet(struct rte_mbuf *m, uint8_t port) qconf->tx_mbufs[port].m_table[len] = m; len++; +#define ENABLE_QUEUES +#ifdef ENABLE_QUEUES /* enough pkts to be sent */ if (unlikely(len == MAX_PKT_BURST)) { l2fwd_send_burst(qconf, MAX_PKT_BURST, port); len = 0; } +#else + l2fwd_send_burst(qconf, 1, port); + len = 0; +#endif qconf->tx_mbufs[port].len = len; return 0; @@ -294,6 +300,7 @@ l2fwd_main_loop(void) diff_tsc = cur_tsc - prev_tsc; if (unlikely(diff_tsc > drain_tsc)) { +#ifdef ENABLE_QUEUES for (portid = 0; portid < RTE_MAX_ETHPORTS; portid++) { if (qconf->tx_mbufs[portid].len == 0) continue; @@ -302,6 +309,7 @@ l2fwd_main_loop(void) (uint8_t) portid); qconf->tx_mbufs[portid].len = 0; } +#endif /* if timer is enabled */ if (timer_period > 0) { @@ -331,7 +339,7 @@ l2fwd_main_loop(void) portid = qconf->rx_port_list[i]; nb_rx = rte_eth_rx_burst((uint8_t) portid, 0, - pkts_burst, MAX_PKT_BURST); + pkts_burst, 1); port_statistics[portid].rx += nb_rx; With these results on my setup: ENABLE_QUEUES defined: 25Mpps ENABLE_QUEUES not defined: 11.7 Mpps I tried something similar in OVS but didn’t see a change in performance! This may because there is a bottleneck somewhere else that is masking the cost of the MMIO writes: --- a/lib/netdev-dpdk.c +++ b/lib/netdev-dpdk.c @@ -62,6 +62,8 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20); #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE #define OVS_VPORT_DPDK "ovs_dpdk" +//#define ENABLE_QUEUES + /* * need to reserve tons of extra space in the mbufs so we can align the * DMA addresses to 4KB. @@ -927,15 +929,17 @@ netdev_dpdk_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet **packets, struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); int nb_rx; +#ifdef ENABLE_QUEUES /* There is only one tx queue for this core. Do not flush other * queueus. */ if (rxq_->queue_id == rte_lcore_id()) { dpdk_queue_flush(dev, rxq_->queue_id); } +#endif nb_rx = rte_eth_rx_burst(rx->port_id, rxq_->queue_id, (struct rte_mbuf **) packets, - NETDEV_MAX_BURST); + 1); if (!nb_rx) { return EAGAIN; } @@ -1034,7 +1038,7 @@ dpdk_queue_pkts(struct netdev_dpdk *dev, int qid, txq->count += tocopy; i += tocopy; - +#ifdef ENABLE_QUEUES if (txq->count == MAX_TX_QUEUE_LEN || txq->flush_tx) { dpdk_queue_flush__(dev, qid); } @@ -1042,6 +1046,9 @@ dpdk_queue_pkts(struct netdev_dpdk *dev, int qid, if (diff_tsc >= DRAIN_TSC) { dpdk_queue_flush__(dev, qid); } +#else + dpdk_queue_flush__(dev, qid); +#endif } } With these results on my setup: ENABLE_QUEUES defined: 3.3Mpps ENABLE_QUEUES not defined: 3.3 Mpps If we were to change the flushing design to remove these queues, we may not see a performance drop now, but we may be inadvertently introducing another bottleneck. MMIO writes are expensive as they are ordered, usually not write combining and have to traverse PCI. > > Thanks. _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev