Hi, Pavan
> -----邮件原件----- > 发件人: Pavan Nikhilesh Bhagavatula <pbhagavat...@marvell.com> > 发送时间: 2021年1月5日 17:29 > 收件人: Feifei Wang <feifei.wa...@arm.com>; jer...@marvell.com; Harry > van Haaren <harry.van.haa...@intel.com> > 抄送: dev@dpdk.org; nd <n...@arm.com>; Honnappa Nagarahalli > <honnappa.nagaraha...@arm.com>; sta...@dpdk.org; Ruifeng Wang > <ruifeng.w...@arm.com>; nd <n...@arm.com> > 主题: RE: [RFC PATCH v1 4/6] app/eventdev: add release barriers for pipeline > test > > Hi Feifei, > > >Hi, Pavan > > > >Sorry for my late reply and thanks very much for your review. > > > >> -----Original Message----- > >> From: Pavan Nikhilesh Bhagavatula > >> <pbhagavat...@marvell.com<mailto:pbhagavat...@marvell.com>> > >> Sent: 2020年12月22日 18:33 > >> To: Feifei Wang <feifei.wa...@arm.com<mailto:feifei.wa...@arm.com>>; > >> jer...@marvell.com<mailto:jer...@marvell.com>; > >Harry van > >> Haaren <harry.van.haa...@intel.com<mailto:harry.van.haa...@intel.com>>; > >> Pavan Nikhilesh > >> <pbhagavat...@caviumnetworks.com<mailto:pbhagavat...@caviumnetworks.com>> > >> Cc: dev@dpdk.org<mailto:dev@dpdk.org>; nd > >> <n...@arm.com<mailto:n...@arm.com>>; Honnappa Nagarahalli > >> <honnappa.nagaraha...@arm.com<mailto:honnappa.nagaraha...@arm.com>>; > >> sta...@dpdk.org<mailto:sta...@dpdk.org>; Phil Yang > >> <phil.y...@arm.com<mailto:phil.y...@arm.com>> > >> Subject: RE: [RFC PATCH v1 4/6] app/eventdev: add release barriers > >for > >> pipeline test > >> > >> > >> >Add release barriers before updating the processed packets for > >worker > >> >lcores to ensure the worker lcore has really finished data > >> >processing and then it can update the processed packets number. > >> > > >> > >> I believe we can live with minor inaccuracies in stats being > >> presented > >as > >> atomics are pretty heavy when scheduler is limited to burst size as 1. > >> > >> One option is to move it before a pipeline operation > >(pipeline_event_tx, > >> pipeline_fwd_event etc.) as they imply implicit release barrier (as > >> all > >the > >> changes done to the event should be visible to the next core). > > > >If I understand correctly, your meaning is that move release barriers > >before pipeline_event_tx or pipeline_fwd_event. This can ensure the > >event has been processed before the next core begins to tx/fwd. For > >example: > > What I meant was event APIs such as `rte_event_enqueue_burst`, > `rte_event_eth_tx_adapter_enqueue` > act as an implicit release barrier and the API `rte_event_dequeue_burst` act > as an implicit acquire barrier. > > Since, pipeline_* test starts with a dequeue() and ends with an enqueue() I > don’t believe we need barriers in Between. Sorry for my misunderstanding this. And I agree with you that no barriers are needed between dequeue and enqueue. Now, let's go back to the beginning. Actually with this patch, our barrier is mainly for the synchronous variable " w->processed_pkts ". As we all know, the event is firstly dequeued and then enqueued, after this, the event can be treated as the processed event and included in the statistics("w->processed_pkts++"). Thus, we add a release barrier before " w->processed_pkts++" is to prevent this operation being executed ahead of time. For example: dequeue -> w->processed_pkts++ -> enqueue This cause that the worker doesn't actually finish this event processing, but the event is treated as the processed one and included in the statistics. ______________________________________________________________________________ By the way, I have two other questions about pipeline process test in "test_pipeline_queue". 1. when do we start counting processed events (w->processed_pkts)? For the fwd mode (internal_port = false), when we choose single stage, application increments the number events processed after "pipeline_event enqueue". However, when we choose multiple stage, application increments the number events processed before "pipleline_event_enqueue". So, maybe we can unify this. For example of multiple stage: if (cq_id == last_queue) { ev.queue_id = tx_queue[ev.mbuf->port]; rte_event_eth_tx_adapter_txq_set(ev.mbuf, 0); pipeline_fwd_event(&ev, RTE_SCHED_TYPE_ATOMIC); + pipeline_event_enqueue(dev, port, &ev); w->processed_pkts++; } else { ev.queue_id++; pipeline_fwd_event(&ev, sched_type_list[cq_id]); + pipeline_event_enqueue(dev, port, &ev); } - pipeline_event_enqueue(dev, port, &ev); 2. Whether "pipeline_event_enqueue" is needed after "pipeline_event_tx" for tx mode? For single_stage_burst_tx mode, after "pipeline_event_tx", the worker has to enqueue again due to "pipeline_event_enqueue_burst", so maybe we should jump out of the loop after “pipeline_event_tx”, for example: if (ev[i].sched_type == RTE_SCHED_TYPE_ATOMIC) { pipeline_event_tx(dev, port, &ev[i]); ev[i].op = RTE_EVENT_OP_RELEASE; w->processed_pkts++; + continue; } else { ev[i].queue_id++; pipeline_fwd_event(&ev[i], RTE_SCHED_TYPE_ATOMIC); } } pipeline_event_enqueue_burst(dev, port, ev, nb_rx); > > > > >if (ev.sched_type == RTE_SCHED_TYPE_ATOMIC) { > > + > > __atomic_thread_fence(__ATOMIC_RELEASE); > > pipeline_event_tx(dev, port, &ev); > > w->processed_pkts++; > > } else { > > ev.queue_id++; > > + > > __atomic_thread_fence(__ATOMIC_RELEASE); > > pipeline_fwd_event(&ev, > >RTE_SCHED_TYPE_ATOMIC); > > pipeline_event_enqueue(dev, port, > > &ev); > > > >However, there are two reasons to prevent this: > > > >First, compare with other tests in app/eventdev, for example, the > >eventdev perf test, the wmb is after event operation to ensure > >operation has been finished and then w->processed_pkts++. > > In case of perf_* tests start with a dequeue() and finally ends with a > mempool_put() should also act as implicit acquire release pairs making stats > consistent? For perf tests, this consistency refers to that there is a wmb after mempool_put(). Please refer to this link: http://patches.dpdk.org/patch/85634/ > > >So, if we move release barriers before tx/fwd, it may cause that the > >tests of app/eventdev become inconsistent.This may reduce the > >maintainability of the code and make it difficult to understand. > > > >Second, it is a test case, though heavy thread may cause performance > >degradation, it can ensure that the operation process and the test > >result are correct. And maybe for a test case, correctness is more > >important than performance. > > > > Most of our internal perf test run on 24/48 core combinations and since > Octeontx2 event device driver supports a burst size of 1, it will show up as > Huge performance degradation. For the impact on performance, I do the test using software driver, following are some test results: ------------------------------------------------------------------------------------------------------------------------------------ Architecture: aarch64 Nics: ixgbe-82599 CPU: Cortex-A72 BURST_SIZE: 1 Order: ./dpdk-test-eventdev -l 0-15 -s 0x2 --vdev=event_sw0 -- --test=pipeline_queue --wlcore=4-14 --prod_type_ethdev --stlist=a,a Flow: one flow, 64bits package, TX rate: 1.4Mpps Without this patch: 0.954 mpps avg 0.953 mpps With this patch: 0.932 mpps avg 0.930 mpps ------------------------------------------------------------------------------------------------------------------------------------ Based on the result above, there is no significant performance degradation with this patch. This is because the release barrier is only for “w->processed_pkts++”. It just ensures that the worker core increments the number events processed after enqueue, and it doesn’t affect dequeue/enqueue: dequeue -> enqueue -> release barrier -> w->processed_pkts++ On the other hand, I infer the reason for the slight decrease in measurement performance is that the release barrier prevent “w->processed_pkts++” before that the event has been processed (enqueue). But I think this test result is closer to the real performance. And sorry for that we have no octentx2 device, so there is no test result on Octeontx2 event device driver. Would you please help us test this patch on octentx2 when you are convenient. Thanks very much. Best Regards Feifei > > >So, due to two reasons above, I'm ambivalent about how we should do in > >the next step. > > > >Best Regards > >Feifei > > Regards, > Pavan. > > > > >> >Fixes: 314bcf58ca8f ("app/eventdev: add pipeline queue worker > >> >functions") > >> >Cc: pbhagavat...@marvell.com<mailto:pbhagavat...@marvell.com> > >> >Cc: sta...@dpdk.org<mailto:sta...@dpdk.org> > >> > > >> >Signed-off-by: Phil Yang <phil.y...@arm.com<mailto:phil.y...@arm.com>> > >> >Signed-off-by: Feifei Wang > >> ><feifei.wa...@arm.com<mailto:feifei.wa...@arm.com>> > >> >Reviewed-by: Ruifeng Wang > >> ><ruifeng.w...@arm.com<mailto:ruifeng.w...@arm.com>> > >> >--- > >> > app/test-eventdev/test_pipeline_queue.c | 64 > >> >+++++++++++++++++++++---- > >> > 1 file changed, 56 insertions(+), 8 deletions(-) > >> > > >> >diff --git a/app/test-eventdev/test_pipeline_queue.c b/app/test- > >> >eventdev/test_pipeline_queue.c index 7bebac34f..0c0ec0ceb > >100644 > >> >--- a/app/test-eventdev/test_pipeline_queue.c > >> >+++ b/app/test-eventdev/test_pipeline_queue.c > >> >@@ -30,7 +30,13 @@ pipeline_queue_worker_single_stage_tx(void > >> >*arg) > >> > > >> > if (ev.sched_type == RTE_SCHED_TYPE_ATOMIC) { > >> > pipeline_event_tx(dev, port, &ev); > >> >- w->processed_pkts++; > >> >+ > >> >+ /* release barrier here ensures stored > >> >operation > >> >+ * of the event completes before the > >> >number of > >> >+ * processed pkts is visible to the main > >> >core > >> >+ */ > >> >+ > >> >__atomic_fetch_add(&(w->processed_pkts), 1, > >> >+ > >> >__ATOMIC_RELEASE); > >> > } else { > >> > ev.queue_id++; > >> > pipeline_fwd_event(&ev, > >> >RTE_SCHED_TYPE_ATOMIC); > >> >@@ -59,7 +65,13 @@ > >pipeline_queue_worker_single_stage_fwd(void > >> >*arg) > >> > rte_event_eth_tx_adapter_txq_set(ev.mbuf, 0); > >> > pipeline_fwd_event(&ev, RTE_SCHED_TYPE_ATOMIC); > >> > pipeline_event_enqueue(dev, port, &ev); > >> >- w->processed_pkts++; > >> >+ > >> >+ /* release barrier here ensures stored operation > >> >+ * of the event completes before the number of > >> >+ * processed pkts is visible to the main core > >> >+ */ > >> >+ __atomic_fetch_add(&(w->processed_pkts), 1, > >> >+ __ATOMIC_RELEASE); > >> > } > >> > > >> > return 0; > >> >@@ -84,7 +96,13 @@ > >> >pipeline_queue_worker_single_stage_burst_tx(void *arg) > >> > if (ev[i].sched_type == > >> >RTE_SCHED_TYPE_ATOMIC) { > >> > > >> > pipeline_event_tx(dev, port, &ev[i]); > >> > ev[i].op = > >> > RTE_EVENT_OP_RELEASE; > >> >- w->processed_pkts++; > >> >+ > >> >+ /* release barrier here > >> >ensures stored > >> >operation > >> >+ * of the event > >> >completes before the > >> >number of > >> >+ * processed pkts is > >> >visible to the main > >> >core > >> >+ */ > >> >+ __atomic_fetch_add(&(w- > >> >>processed_pkts), 1, > >> >+ > >> > __ATOMIC_RELEASE); > >> > } else { > >> > ev[i].queue_id++; > >> > > >> > pipeline_fwd_event(&ev[i], > >> >@@ -121,7 +139,13 @@ > >> >pipeline_queue_worker_single_stage_burst_fwd(void *arg) > >> > } > >> > > >> > pipeline_event_enqueue_burst(dev, port, ev, nb_rx); > >> >- w->processed_pkts += nb_rx; > >> >+ > >> >+ /* release barrier here ensures stored operation > >> >+ * of the event completes before the number of > >> >+ * processed pkts is visible to the main core > >> >+ */ > >> >+ __atomic_fetch_add(&(w->processed_pkts), nb_rx, > >> >+ __ATOMIC_RELEASE); > >> > } > >> > > >> > return 0; > >> >@@ -146,7 +170,13 @@ > >pipeline_queue_worker_multi_stage_tx(void > >> >*arg) > >> > > >> > if (ev.queue_id == tx_queue[ev.mbuf->port]) { > >> > pipeline_event_tx(dev, port, &ev); > >> >- w->processed_pkts++; > >> >+ > >> >+ /* release barrier here ensures stored > >> >operation > >> >+ * of the event completes before the > >> >number of > >> >+ * processed pkts is visible to the main > >> >core > >> >+ */ > >> >+ > >> >__atomic_fetch_add(&(w->processed_pkts), 1, > >> >+ > >> >__ATOMIC_RELEASE); > >> > continue; > >> > } > >> > > >> >@@ -180,7 +210,13 @@ > >> >pipeline_queue_worker_multi_stage_fwd(void *arg) > >> > ev.queue_id = tx_queue[ev.mbuf->port]; > >> > > >> > rte_event_eth_tx_adapter_txq_set(ev.mbuf, 0); > >> > pipeline_fwd_event(&ev, > >> >RTE_SCHED_TYPE_ATOMIC); > >> >- w->processed_pkts++; > >> >+ > >> >+ /* release barrier here ensures stored > >> >operation > >> >+ * of the event completes before the > >> >number of > >> >+ * processed pkts is visible to the main > >> >core > >> >+ */ > >> >+ > >> >__atomic_fetch_add(&(w->processed_pkts), 1, > >> >+ > >> >__ATOMIC_RELEASE); > >> > } else { > >> > ev.queue_id++; > >> > pipeline_fwd_event(&ev, > >> >sched_type_list[cq_id]); > >> >@@ -214,7 +250,13 @@ > >> >pipeline_queue_worker_multi_stage_burst_tx(void *arg) > >> > if (ev[i].queue_id == > >> > tx_queue[ev[i].mbuf- > >> >>port]) { > >> > > >> > pipeline_event_tx(dev, port, &ev[i]); > >> > ev[i].op = > >> > RTE_EVENT_OP_RELEASE; > >> >- w->processed_pkts++; > >> >+ > >> >+ /* release barrier here > >> >ensures stored > >> >operation > >> >+ * of the event > >> >completes before the > >> >number of > >> >+ * processed pkts is > >> >visible to the main > >> >core > >> >+ */ > >> >+ __atomic_fetch_add(&(w- > >> >>processed_pkts), 1, > >> >+ > >> > __ATOMIC_RELEASE); > >> > continue; > >> > } > >> > > >> >@@ -254,7 +296,13 @@ > >> >pipeline_queue_worker_multi_stage_burst_fwd(void *arg) > >> > > >> > rte_event_eth_tx_adapter_txq_set(ev[i].mbuf, 0); > >> > > >> > pipeline_fwd_event(&ev[i], > >> > > >> > RTE_SCHED_TYPE_ATOMIC); > >> >- w->processed_pkts++; > >> >+ > >> >+ /* release barrier here > >> >ensures stored > >> >operation > >> >+ * of the event > >> >completes before the > >> >number of > >> >+ * processed pkts is > >> >visible to the main > >> >core > >> >+ */ > >> >+ __atomic_fetch_add(&(w- > >> >>processed_pkts), 1, > >> >+ > >> > __ATOMIC_RELEASE); > >> > } else { > >> > ev[i].queue_id++; > >> > > >> > pipeline_fwd_event(&ev[i], > >> >-- > >> >2.17.1