Hi Feifei, >Hi, Pavan > > > >> -----邮件原件----- > >> 发件人: Pavan Nikhilesh Bhagavatula <pbhagavat...@marvell.com> > >> 发送时间: 2021年1月5日 17:29 > >> 收件人: Feifei Wang <feifei.wa...@arm.com>; jer...@marvell.com; >Harry > >> van Haaren <harry.van.haa...@intel.com> > >> 抄送: dev@dpdk.org; nd <n...@arm.com>; Honnappa Nagarahalli > >> <honnappa.nagaraha...@arm.com>; sta...@dpdk.org; Ruifeng Wang > >> <ruifeng.w...@arm.com>; nd <n...@arm.com> > >> 主题: RE: [RFC PATCH v1 4/6] app/eventdev: add release barriers for >pipeline > >> test > >> > >> Hi Feifei, > >> > >> >Hi, Pavan > >> > > >> >Sorry for my late reply and thanks very much for your review. > >> > > >> >> -----Original Message----- > >> >> From: Pavan Nikhilesh Bhagavatula ><pbhagavat...@marvell.com<mailto:pbhagavat...@marvell.com>> > >> >> Sent: 2020年12月22日 18:33 > >> >> To: Feifei Wang ><feifei.wa...@arm.com<mailto:feifei.wa...@arm.com>>; >jer...@marvell.com<mailto:jer...@marvell.com>; > >> >Harry van > >> >> Haaren ><harry.van.haa...@intel.com<mailto:harry.van.haa...@intel.com>>; >Pavan Nikhilesh > >> >> ><pbhagavat...@caviumnetworks.com<mailto:pbhagavatula@caviumn >etworks.com>> > >> >> Cc: dev@dpdk.org<mailto:dev@dpdk.org>; nd ><n...@arm.com<mailto:n...@arm.com>>; Honnappa Nagarahalli > >> >> ><honnappa.nagaraha...@arm.com<mailto:Honnappa.Nagarahalli@arm >.com>>; sta...@dpdk.org<mailto:sta...@dpdk.org>; Phil Yang > >> >> <phil.y...@arm.com<mailto:phil.y...@arm.com>> > >> >> Subject: RE: [RFC PATCH v1 4/6] app/eventdev: add release >barriers > >> >for > >> >> pipeline test > >> >> > >> >> > >> >> >Add release barriers before updating the processed packets for > >> >worker > >> >> >lcores to ensure the worker lcore has really finished data > >> >> >processing and then it can update the processed packets number. > >> >> > > >> >> > >> >> I believe we can live with minor inaccuracies in stats being > >> >> presented > >> >as > >> >> atomics are pretty heavy when scheduler is limited to burst size as >1. > >> >> > >> >> One option is to move it before a pipeline operation > >> >(pipeline_event_tx, > >> >> pipeline_fwd_event etc.) as they imply implicit release barrier (as > >> >> all > >> >the > >> >> changes done to the event should be visible to the next core). > >> > > >> >If I understand correctly, your meaning is that move release barriers > >> >before pipeline_event_tx or pipeline_fwd_event. This can ensure the > >> >event has been processed before the next core begins to tx/fwd. For > >> >example: > >> > >> What I meant was event APIs such as `rte_event_enqueue_burst`, > >> `rte_event_eth_tx_adapter_enqueue` > >> act as an implicit release barrier and the API >`rte_event_dequeue_burst` act > >> as an implicit acquire barrier. > > > >> > >> Since, pipeline_* test starts with a dequeue() and ends with an >enqueue() I > >> don’t believe we need barriers in Between. > > > >Sorry for my misunderstanding this. And I agree with you that no >barriers are > >needed between dequeue and enqueue. > > > >Now, let's go back to the beginning. Actually with this patch, our barrier >is mainly > >for the synchronous variable " w->processed_pkts ". As we all know, the >event is firstly > >dequeued and then enqueued, after this, the event can be treated as >the processed event > >and included in the statistics("w->processed_pkts++"). > > > >Thus, we add a release barrier before " w->processed_pkts++" is to >prevent this operation > >being executed ahead of time. For example: > >dequeue -> w->processed_pkts++ -> enqueue > >This cause that the worker doesn't actually finish this event processing, >but the event is treated > >as the processed one and included in the statistics. >
But the current sequence is dequeue-> enqueue-> w->processed_pkts++ and enqueue already acts as an explicit release barrier right? > > >__________________________________________________________ >____________________ > > > >By the way, I have two other questions about pipeline process test in >"test_pipeline_queue". > >1. when do we start counting processed events (w->processed_pkts)? > >For the fwd mode (internal_port = false), when we choose single stage, >application increments > >the number events processed after "pipeline_event enqueue". >However, when we choose multiple > >stage, application increments the number events processed before >"pipleline_event_enqueue". We count an event as process when all the stages have completed and its Trasnmitted. >So, > >maybe we can unify this. For example of multiple stage: > > > > if (cq_id == last_queue) { > > ev.queue_id = > tx_queue[ev.mbuf->port]; > > > rte_event_eth_tx_adapter_txq_set(ev.mbuf, >0); > > pipeline_fwd_event(&ev, >RTE_SCHED_TYPE_ATOMIC); > > + pipeline_event_enqueue(dev, > port, &ev); > > w->processed_pkts++; > > } else { > > ev.queue_id++; > > pipeline_fwd_event(&ev, >sched_type_list[cq_id]); > > + pipeline_event_enqueue(dev, > port, &ev); > > } > > > > - pipeline_event_enqueue(dev, port, &ev); > > The above change makes sense. > >2. Whether "pipeline_event_enqueue" is needed after >"pipeline_event_tx" for tx mode? > >For single_stage_burst_tx mode, after "pipeline_event_tx", the worker >has to enqueue again > >due to "pipeline_event_enqueue_burst", so maybe we should jump out >of the loop after > >“pipeline_event_tx”, We call enqueue burst to release the events i.e. enqueue events with RTE_EVENT_OP_RELEASE. > for example: > > > > if (ev[i].sched_type == >RTE_SCHED_TYPE_ATOMIC) { > > > pipeline_event_tx(dev, port, &ev[i]); > > ev[i].op = > RTE_EVENT_OP_RELEASE; > > > w->processed_pkts++; > > + continue; > > } else { > > > ev[i].queue_id++; > > > pipeline_fwd_event(&ev[i], > > >RTE_SCHED_TYPE_ATOMIC); > > } > > } > > > > pipeline_event_enqueue_burst(dev, port, ev, > nb_rx); > > > > > >> > >> > > >> >if (ev.sched_type == RTE_SCHED_TYPE_ATOMIC) { > >> > + >__atomic_thread_fence(__ATOMIC_RELEASE); > >> > pipeline_event_tx(dev, port, &ev); > >> > w->processed_pkts++; > >> > } else { > >> > ev.queue_id++; > >> > + >__atomic_thread_fence(__ATOMIC_RELEASE); > >> > pipeline_fwd_event(&ev, > >> >RTE_SCHED_TYPE_ATOMIC); > >> > pipeline_event_enqueue(dev, port, >> > &ev); > >> > > >> >However, there are two reasons to prevent this: > >> > > >> >First, compare with other tests in app/eventdev, for example, the > >> >eventdev perf test, the wmb is after event operation to ensure > >> >operation has been finished and then w->processed_pkts++. > >> > >> In case of perf_* tests start with a dequeue() and finally ends with a > >> mempool_put() should also act as implicit acquire release pairs >making stats > >> consistent? > > > >For perf tests, this consistency refers to that there is a wmb after >mempool_put(). > >Please refer to this link: > >https://urldefense.proofpoint.com/v2/url?u=http- >3A__patches.dpdk.org_patch_85634_&d=DwIGaQ&c=nKjWec2b6R0mO >yPaz7xtfQ&r=1cjuAHrGh745jHNmj2fD85sUMIJ2IPIDsIJzo6FN6Z0&m=zg >QHeSDiXWfI1PIIUxXBqMS6E-2_3G46nhrzGXoBpHI&s=0FwTxPXjWflh- >sdmnkY133IPlJB780x0yxe7Am3JCBw&e= > > > >> > >> >So, if we move release barriers before tx/fwd, it may cause that the > >> >tests of app/eventdev become inconsistent.This may reduce the > >> >maintainability of the code and make it difficult to understand. > >> > > >> >Second, it is a test case, though heavy thread may cause >performance > >> >degradation, it can ensure that the operation process and the test > >> >result are correct. And maybe for a test case, correctness is more > >> >important than performance. > >> > > >> > >> Most of our internal perf test run on 24/48 core combinations and >since > >> Octeontx2 event device driver supports a burst size of 1, it will show >up as > >> Huge performance degradation. > > > >For the impact on performance, I do the test using software driver, >following are some test results: > >--------------------------------------------------------------------------------------- >--------------------------------------------- > >Architecture: aarch64 > >Nics: ixgbe-82599 > >CPU: Cortex-A72 > >BURST_SIZE: 1 > >Order: ./dpdk-test-eventdev -l 0-15 -s 0x2 --vdev=event_sw0 -- -- >test=pipeline_queue --wlcore=4-14 --prod_type_ethdev --stlist=a,a > >Flow: one flow, 64bits package, TX rate: 1.4Mpps > > > >Without this patch: > >0.954 mpps avg 0.953 mpps > > > >With this patch: > >0.932 mpps avg 0.930 mpps > >--------------------------------------------------------------------------------------- >--------------------------------------------- > > > >Based on the result above, there is no significant performance >degradation with this patch. > >This is because the release barrier is only for “w->processed_pkts++”. It >just ensures that the worker core > >increments the number events processed after enqueue, and it doesn’t >affect dequeue/enqueue: > > > >dequeue -> enqueue -> release barrier -> w->processed_pkts++ > Here enqueue already acts as an explicit release barrier. > > >On the other hand, I infer the reason for the slight decrease in >measurement performance is that the release barrier > >prevent “w->processed_pkts++” before that the event has been >processed (enqueue). But I think this test result is closer > >to the real performance. > >And sorry for that we have no octentx2 device, so there is no test result >on Octeontx2 event device driver. Would you please > >help us test this patch on octentx2 when you are convenient. Thanks >very much. > I will report the performance numbers on Monday. > > >Best Regards > >Feifei Regards, Pavan. > > > >> > >> >So, due to two reasons above, I'm ambivalent about how we should >do in > >> >the next step. > >> > > >> >Best Regards > >> >Feifei > >> > >> Regards, > >> Pavan. > >> > >> > > >> >> >Fixes: 314bcf58ca8f ("app/eventdev: add pipeline queue worker > >> >> >functions") > >> >> >Cc: >pbhagavat...@marvell.com<mailto:pbhagavat...@marvell.com> > >> >> >Cc: sta...@dpdk.org<mailto:sta...@dpdk.org> > >> >> > > >> >> >Signed-off-by: Phil Yang ><phil.y...@arm.com<mailto:phil.y...@arm.com>> > >> >> >Signed-off-by: Feifei Wang ><feifei.wa...@arm.com<mailto:feifei.wa...@arm.com>> > >> >> >Reviewed-by: Ruifeng Wang ><ruifeng.w...@arm.com<mailto:ruifeng.w...@arm.com>> > >> >> >--- > >> >> > app/test-eventdev/test_pipeline_queue.c | 64 > >> >> >+++++++++++++++++++++---- > >> >> > 1 file changed, 56 insertions(+), 8 deletions(-) > >> >> > > >> >> >diff --git a/app/test-eventdev/test_pipeline_queue.c b/app/test- > >> >> >eventdev/test_pipeline_queue.c index 7bebac34f..0c0ec0ceb > >> >100644 > >> >> >--- a/app/test-eventdev/test_pipeline_queue.c > >> >> >+++ b/app/test-eventdev/test_pipeline_queue.c > >> >> >@@ -30,7 +30,13 @@ >pipeline_queue_worker_single_stage_tx(void > >> >> >*arg) > >> >> > > >> >> > if (ev.sched_type == RTE_SCHED_TYPE_ATOMIC) { > >> >> > pipeline_event_tx(dev, port, &ev); > >> >> >- w->processed_pkts++; > >> >> >+ > >> >> >+ /* release barrier here ensures stored >operation > >> >> >+ * of the event completes before the >> >> >number >of > >> >> >+ * processed pkts is visible to the >> >> >main core > >> >> >+ */ > >> >> >+ >> >> >__atomic_fetch_add(&(w->processed_pkts), >1, > >> >> >+ >> >> >__ATOMIC_RELEASE); > >> >> > } else { > >> >> > ev.queue_id++; > >> >> > pipeline_fwd_event(&ev, > >> >> >RTE_SCHED_TYPE_ATOMIC); > >> >> >@@ -59,7 +65,13 @@ > >> >pipeline_queue_worker_single_stage_fwd(void > >> >> >*arg) > >> >> > rte_event_eth_tx_adapter_txq_set(ev.mbuf, 0); > >> >> > pipeline_fwd_event(&ev, RTE_SCHED_TYPE_ATOMIC); > >> >> > pipeline_event_enqueue(dev, port, &ev); > >> >> >- w->processed_pkts++; > >> >> >+ > >> >> >+ /* release barrier here ensures stored operation > >> >> >+ * of the event completes before the number of > >> >> >+ * processed pkts is visible to the main core > >> >> >+ */ > >> >> >+ __atomic_fetch_add(&(w->processed_pkts), 1, > >> >> >+ __ATOMIC_RELEASE); > >> >> > } > >> >> > > >> >> > return 0; > >> >> >@@ -84,7 +96,13 @@ > >> >> >pipeline_queue_worker_single_stage_burst_tx(void *arg) > >> >> > if (ev[i].sched_type == > >> >> >RTE_SCHED_TYPE_ATOMIC) { > >> >> > >> >> > pipeline_event_tx(dev, port, &ev[i]); > >> >> > ev[i].op = >> >> > RTE_EVENT_OP_RELEASE; > >> >> >- w->processed_pkts++; > >> >> >+ > >> >> >+ /* release barrier >> >> >here ensures stored > >> >> >operation > >> >> >+ * of the event >> >> >completes before the > >> >> >number of > >> >> >+ * processed pkts is >> >> >visible to the main > >> >> >core > >> >> >+ */ > >> >> >+ __atomic_fetch_add(&(w- > >> >> >>processed_pkts), 1, > >> >> >+ >__ATOMIC_RELEASE); > >> >> > } else { > >> >> > ev[i].queue_id++; > >> >> > >> >> > pipeline_fwd_event(&ev[i], > >> >> >@@ -121,7 +139,13 @@ > >> >> >pipeline_queue_worker_single_stage_burst_fwd(void *arg) > >> >> > } > >> >> > > >> >> > pipeline_event_enqueue_burst(dev, port, ev, nb_rx); > >> >> >- w->processed_pkts += nb_rx; > >> >> >+ > >> >> >+ /* release barrier here ensures stored operation > >> >> >+ * of the event completes before the number of > >> >> >+ * processed pkts is visible to the main core > >> >> >+ */ > >> >> >+ __atomic_fetch_add(&(w->processed_pkts), nb_rx, > >> >> >+ __ATOMIC_RELEASE); > >> >> > } > >> >> > > >> >> > return 0; > >> >> >@@ -146,7 +170,13 @@ > >> >pipeline_queue_worker_multi_stage_tx(void > >> >> >*arg) > >> >> > > >> >> > if (ev.queue_id == tx_queue[ev.mbuf->port]) { > >> >> > pipeline_event_tx(dev, port, &ev); > >> >> >- w->processed_pkts++; > >> >> >+ > >> >> >+ /* release barrier here ensures stored >operation > >> >> >+ * of the event completes before the >> >> >number >of > >> >> >+ * processed pkts is visible to the >> >> >main core > >> >> >+ */ > >> >> >+ >> >> >__atomic_fetch_add(&(w->processed_pkts), >1, > >> >> >+ >> >> >__ATOMIC_RELEASE); > >> >> > continue; > >> >> > } > >> >> > > >> >> >@@ -180,7 +210,13 @@ > >> >> >pipeline_queue_worker_multi_stage_fwd(void *arg) > >> >> > ev.queue_id = >> >> > tx_queue[ev.mbuf->port]; > >> >> > >> >> > rte_event_eth_tx_adapter_txq_set(ev.mbuf, >0); > >> >> > pipeline_fwd_event(&ev, > >> >> >RTE_SCHED_TYPE_ATOMIC); > >> >> >- w->processed_pkts++; > >> >> >+ > >> >> >+ /* release barrier here ensures stored >operation > >> >> >+ * of the event completes before the >> >> >number >of > >> >> >+ * processed pkts is visible to the >> >> >main core > >> >> >+ */ > >> >> >+ >> >> >__atomic_fetch_add(&(w->processed_pkts), >1, > >> >> >+ >> >> >__ATOMIC_RELEASE); > >> >> > } else { > >> >> > ev.queue_id++; > >> >> > pipeline_fwd_event(&ev, > >> >> >sched_type_list[cq_id]); > >> >> >@@ -214,7 +250,13 @@ > >> >> >pipeline_queue_worker_multi_stage_burst_tx(void *arg) > >> >> > if (ev[i].queue_id == >> >> > tx_queue[ev[i].mbuf- > >> >> >>port]) { > >> >> > >> >> > pipeline_event_tx(dev, port, &ev[i]); > >> >> > ev[i].op = >> >> > RTE_EVENT_OP_RELEASE; > >> >> >- w->processed_pkts++; > >> >> >+ > >> >> >+ /* release barrier >> >> >here ensures stored > >> >> >operation > >> >> >+ * of the event >> >> >completes before the > >> >> >number of > >> >> >+ * processed pkts is >> >> >visible to the main > >> >> >core > >> >> >+ */ > >> >> >+ __atomic_fetch_add(&(w- > >> >> >>processed_pkts), 1, > >> >> >+ >__ATOMIC_RELEASE); > >> >> > continue; > >> >> > } > >> >> > > >> >> >@@ -254,7 +296,13 @@ > >> >> >pipeline_queue_worker_multi_stage_burst_fwd(void *arg) > >> >> > > >> >> > rte_event_eth_tx_adapter_txq_set(ev[i].mbuf, 0); > >> >> > >> >> > pipeline_fwd_event(&ev[i], > >> >> > > >> >> > RTE_SCHED_TYPE_ATOMIC); > >> >> >- w->processed_pkts++; > >> >> >+ > >> >> >+ /* release barrier >> >> >here ensures stored > >> >> >operation > >> >> >+ * of the event >> >> >completes before the > >> >> >number of > >> >> >+ * processed pkts is >> >> >visible to the main > >> >> >core > >> >> >+ */ > >> >> >+ __atomic_fetch_add(&(w- > >> >> >>processed_pkts), 1, > >> >> >+ >__ATOMIC_RELEASE); > >> >> > } else { > >> >> > ev[i].queue_id++; > >> >> > >> >> > pipeline_fwd_event(&ev[i], > >> >> >-- > >> >> >2.17.1 >