Hi Feifei, >Hi Feifei, > >>Hi, Pavan >> >>> -----邮件原件----- >>> 发件人: Pavan Nikhilesh Bhagavatula <pbhagavat...@marvell.com> >>> 发送时间: 2021年1月8日 17:13 >>> 收件人: Feifei Wang <feifei.wa...@arm.com>; >jer...@marvell.com; >>Harry >>> van Haaren <harry.van.haa...@intel.com> >>> 抄送: dev@dpdk.org; nd <n...@arm.com>; Honnappa Nagarahalli >>> <honnappa.nagaraha...@arm.com>; sta...@dpdk.org; Ruifeng >Wang >>> <ruifeng.w...@arm.com>; nd <n...@arm.com>; nd <n...@arm.com> >>> 主题: RE: [RFC PATCH v1 4/6] app/eventdev: add release barriers for >>pipeline >>> test >>> >>> Hi Feifei, >>> >>> >Hi, Pavan >>> > >>> > >>> > >>> >> -----邮件原件----- >>> > >>> >> 发件人: Pavan Nikhilesh Bhagavatula >><mailto:pbhagavat...@marvell.com> >>> > >>> >> 发送时间: 2021年1月5日 17:29 >>> > >>> >> 收件人: Feifei Wang <mailto:feifei.wa...@arm.com>; >>mailto:jer...@marvell.com; >>> >Harry >>> > >>> >> van Haaren <mailto:harry.van.haa...@intel.com> >>> > >>> >> 抄送: mailto:dev@dpdk.org; nd <mailto:n...@arm.com>; >Honnappa >>Nagarahalli >>> > >>> >> <mailto:honnappa.nagaraha...@arm.com>; >>mailto:sta...@dpdk.org; Ruifeng Wang >>> > >>> >> <mailto:ruifeng.w...@arm.com>; nd <mailto:n...@arm.com> >>> > >>> >> 主题: RE: [RFC PATCH v1 4/6] app/eventdev: add release barriers >>for >>> >pipeline >>> > >>> >> test >>> > >>> >> >>> > >>> >> Hi Feifei, >>> > >>> >> >>> > >>> >> >Hi, Pavan >>> > >>> >> > >>> > >>> >> >Sorry for my late reply and thanks very much for your review. >>> > >>> >> > >>> > >>> >> >> -----Original Message----- >>> > >>> >> >> From: Pavan Nikhilesh Bhagavatula >>> >>><mailto:pbhagavat...@marvell.com<mailto:pbhagavatula@marvell. >c >>om>> >>> > >>> >> >> Sent: 2020年12月22日 18:33 >>> > >>> >> >> To: Feifei Wang >>> >><mailto:feifei.wa...@arm.com<mailto:feifei.wa...@arm.com>>; >>> >mailto:jer...@marvell.com<mailto:jer...@marvell.com>; >>> > >>> >> >Harry van >>> > >>> >> >> Haaren >>> >>><mailto:harry.van.haa...@intel.com<mailto:harry.van.haaren@intel >. >>com>>; >>> >Pavan Nikhilesh >>> > >>> >> >> >>> >>><pbhagavat...@caviumnetworks.com<mailto:pbhagavatula@caviu >m >>n >>> >etworks.com>> >>> > >>> >> >> Cc: mailto:dev@dpdk.org<mailto:dev@dpdk.org>; nd >>> ><mailto:n...@arm.com<mailto:n...@arm.com>>; Honnappa >>Nagarahalli >>> > >>> >> >> >>> >>><honnappa.nagaraha...@arm.com<mailto:Honnappa.Nagarahalli@a >r >>m >>> >.com>>; mailto:sta...@dpdk.org<mailto:sta...@dpdk.org>; Phil >>Yang >>> > >>> >> >> <mailto:phil.y...@arm.com<mailto:phil.y...@arm.com>> >>> > >>> >> >> Subject: RE: [RFC PATCH v1 4/6] app/eventdev: add release >>> >barriers >>> > >>> >> >for >>> > >>> >> >> pipeline test >>> > >>> >> >> >>> > >>> >> >> >>> > >>> >> >> >Add release barriers before updating the processed packets >for >>> > >>> >> >worker >>> > >>> >> >> >lcores to ensure the worker lcore has really finished data >>> > >>> >> >> >processing and then it can update the processed packets >>number. >>> > >>> >> >> > >>> > >>> >> >> >>> > >>> >> >> I believe we can live with minor inaccuracies in stats being >>> > >>> >> >> presented >>> > >>> >> >as >>> > >>> >> >> atomics are pretty heavy when scheduler is limited to burst >size >>> >> >> as >>> >1. >>> > >>> >> >> >>> > >>> >> >> One option is to move it before a pipeline operation >>> > >>> >> >(pipeline_event_tx, >>> > >>> >> >> pipeline_fwd_event etc.) as they imply implicit release barrier >>> >> >> (as >>> > >>> >> >> all >>> > >>> >> >the >>> > >>> >> >> changes done to the event should be visible to the next core). >>> > >>> >> > >>> > >>> >> >If I understand correctly, your meaning is that move release >>> >> >barriers >>> > >>> >> >before pipeline_event_tx or pipeline_fwd_event. This can ensure >>the >>> > >>> >> >event has been processed before the next core begins to tx/fwd. >>For >>> > >>> >> >example: >>> > >>> >> >>> > >>> >> What I meant was event APIs such as `rte_event_enqueue_burst`, >>> > >>> >> `rte_event_eth_tx_adapter_enqueue` >>> > >>> >> act as an implicit release barrier and the API >>> >`rte_event_dequeue_burst` act >>> > >>> >> as an implicit acquire barrier. >>> > >>> > >>> > >>> >> >>> > >>> >> Since, pipeline_* test starts with a dequeue() and ends with an >>> >enqueue() I >>> > >>> >> don’t believe we need barriers in Between. >>> > >>> > >>> > >>> >Sorry for my misunderstanding this. And I agree with you that no >>> >barriers are >>> > >>> >needed between dequeue and enqueue. >>> > >>> > >>> > >>> >Now, let's go back to the beginning. Actually with this patch, our >>> >barrier is mainly >>> > >>> >for the synchronous variable " w->processed_pkts ". As we all >know, >>the >>> >event is firstly >>> > >>> >dequeued and then enqueued, after this, the event can be treated >as >>the >>> >processed event >>> > >>> >and included in the statistics("w->processed_pkts++"). >>> > >>> > >>> > >>> >Thus, we add a release barrier before " w->processed_pkts++" is to >>> >prevent this operation >>> > >>> >being executed ahead of time. For example: >>> > >>> >dequeue -> w->processed_pkts++ -> enqueue >>> > >>> >This cause that the worker doesn't actually finish this event >>> >processing, but the event is treated >>> > >>> >as the processed one and included in the statistics. >>> > >>> >>> But the current sequence is dequeue-> enqueue-> w- >>>processed_pkts++ >>> and enqueue already acts as an explicit release barrier right? >>> >> >>Sorry maybe I cannot understand how “enqueue” as an explicit release >>barrier. I think of two possibilities: >>1. As you say before, all the changes done to the event should be >visible >>to the next core and enqueue is a operation for event, so the next >core >>should wait for the event to be enqueued. >>I think this is due to data dependence for the same variable. However, >>‘w->processed_pkts’ and ‘ev’ are different variables, so this cannot >>prevent ‘w->processed_pkts++’ before enqueue. >>And the main core may load updated ‘w->processed_pkts’ but actually >>the event is still being processed. For example: >> >> Time Slot Worker 1 Main core >> 1 dequeue >> 2 w->processed_pkts++ >> 3 load w- >>processed_pkts >> 4 enqueue >> >>2. Some release barriers have been included in enqueue. There is a >>release barrier in rte_ring_enqueue : >>move head -> copy elements to the ring -> release barrier -> update >tail >>-> w->processed_pkts++ >>However, this barrier cannot prevent ‘w->processed_pkts++’ before >>update tail, and when update_tail has been finished, the enqueue >>process can be seen completed. > >I was talking about case 2 in particular almost all enqueue calls have >some kind of >release barrier in place. I do agree w->processed_pkts++ might get >reordered with >tail update but since enqueue itself is a ldr + blr I was hoping that it >wouldn't occur. > >We can continue the discussion once I have some performance data. > >Thanks for your patience :) >Pavan. > >> >>> > >>> > >>> >>>________________________________________________________ >_ >>> _ >>> >____________________ >>> > >>> > >>> > >>> >By the way, I have two other questions about pipeline process test >in >>> >"test_pipeline_queue". >>> > >>> >1. when do we start counting processed events (w- >>>processed_pkts)? >>> > >>> >For the fwd mode (internal_port = false), when we choose single >>stage, >>> >application increments >>> > >>> >the number events processed after "pipeline_event enqueue". >>> >However, when we choose multiple >>> > >>> >stage, application increments the number events processed before >>> >"pipleline_event_enqueue". >>> >>> We count an event as process when all the stages have completed >>and its >>> Trasnmitted. >>> >>> >So, >>> > >>> >maybe we can unify this. For example of multiple stage: >>> > >>> > >>> > >>> > if (cq_id == last_queue) { >>> > >>> > ev.queue_id = >>> > tx_queue[ev.mbuf->port]; >>> > >>> > >>> >rte_event_eth_tx_adapter_txq_set(ev.mbuf, >>> >0); >>> > >>> > pipeline_fwd_event(&ev, >>> >RTE_SCHED_TYPE_ATOMIC); >>> > >>> > + pipeline_event_enqueue(dev, >>> > port, &ev); >>> > >>> > w->processed_pkts++; >>> > >>> > } else { >>> > >>> > ev.queue_id++; >>> > >>> > pipeline_fwd_event(&ev, >>> >sched_type_list[cq_id]); >>> > >>> > + pipeline_event_enqueue(dev, >>> > port, &ev); >>> > >>> > } >>> > >>> > >>> > >>> > - pipeline_event_enqueue(dev, port, &ev); >>> > >>> > >>> >>> The above change makes sense. >>> >>Thanks for your review, and I’ll update this change into the next >>version. >>> > >>> >2. Whether "pipeline_event_enqueue" is needed after >>> >"pipeline_event_tx" for tx mode? >>> > >>> >For single_stage_burst_tx mode, after "pipeline_event_tx", the >>worker >>> >has to enqueue again >>> > >>> >due to "pipeline_event_enqueue_burst", so maybe we should jump >>out of >>> >the loop after >>> > >>> >“pipeline_event_tx”, >>> >>> We call enqueue burst to release the events i.e. enqueue events with >>> RTE_EVENT_OP_RELEASE. >>> >>However, >>In case of single event, for ' pipeline_queue_worker_single_stage_tx' >>and ' pipeline_queue_worker_multi_stage_tx', >>after tx, there is no release operation. >>
I think this was done before the feature disable_implicit_release `RTE_EVENT_DEV_CAP_IMPLICIT_RELEASE_DISABLE ` was implemented. We can remove the event_op_release section of the code and have a continue. >>> > >>> for example: >>> > >>> > >>> > >>> > if (ev[i].sched_type == >>> >RTE_SCHED_TYPE_ATOMIC) { >>> > >>> > >>> > pipeline_event_tx(dev, port, &ev[i]); >>> > >>> > >>> > ev[i].op = RTE_EVENT_OP_RELEASE; >>> > >>> > >>> > w->processed_pkts++; >>> > >>> > + continue; >>> > >>> > } else { >>> > >>> > >>> > ev[i].queue_id++; >>> > >>> > >>> > pipeline_fwd_event(&ev[i], >>> > >>> > >>> >RTE_SCHED_TYPE_ATOMIC); >>> > >>> > } >>> > >>> > } >>> > >>> > >>> > >>> > pipeline_event_enqueue_burst(dev, port, >>> > ev, nb_rx); >>> > >>> > >>> > >>> > >>> > >>> >> >>> > >>> >> > >>> > >>> >> >if (ev.sched_type == RTE_SCHED_TYPE_ATOMIC) { >>> > >>> >> > + >>> >__atomic_thread_fence(__ATOMIC_RELEASE); >>> > >>> >> > pipeline_event_tx(dev, >>> >> > port, &ev); >>> > >>> >> > w->processed_pkts++; >>> > >>> >> > } else { >>> > >>> >> > ev.queue_id++; >>> > >>> >> > + >>> >__atomic_thread_fence(__ATOMIC_RELEASE); >>> > >>> >> > pipeline_fwd_event(&ev, >>> > >>> >> >RTE_SCHED_TYPE_ATOMIC); >>> > >>> >> > >>> >> > pipeline_event_enqueue(dev, port, &ev); >>> > >>> >> > >>> > >>> >> >However, there are two reasons to prevent this: >>> > >>> >> > >>> > >>> >> >First, compare with other tests in app/eventdev, for example, >the >>> > >>> >> >eventdev perf test, the wmb is after event operation to ensure >>> > >>> >> >operation has been finished and then w->processed_pkts++. >>> > >>> >> >>> > >>> >> In case of perf_* tests start with a dequeue() and finally ends with >>> >> a >>> > >>> >> mempool_put() should also act as implicit acquire release pairs >>> >making stats >>> > >>> >> consistent? >>> > >>> > >>> > >>> >For perf tests, this consistency refers to that there is a wmb after >>> >mempool_put(). >>> > >>> >Please refer to this link: >>> > >>> >https://urldefense.proofpoint.com/v2/url?u=http- >>> >>>3A__patches.dpdk.org_patch_85634_&d=DwIGaQ&c=nKjWec2b6R0 >m >>O >>> >>>yPaz7xtfQ&r=1cjuAHrGh745jHNmj2fD85sUMIJ2IPIDsIJzo6FN6Z0&m= >z >>g >>> >QHeSDiXWfI1PIIUxXBqMS6E- >>2_3G46nhrzGXoBpHI&s=0FwTxPXjWflh- >>> >sdmnkY133IPlJB780x0yxe7Am3JCBw&e= >>> > >>> > >>> > >>> >> >>> > >>> >> >So, if we move release barriers before tx/fwd, it may cause that >>the >>> > >>> >> >tests of app/eventdev become inconsistent.This may reduce the >>> > >>> >> >maintainability of the code and make it difficult to understand. >>> > >>> >> > >>> > >>> >> >Second, it is a test case, though heavy thread may cause >>> >performance >>> > >>> >> >degradation, it can ensure that the operation process and the >test >>> > >>> >> >result are correct. And maybe for a test case, correctness is >more >>> > >>> >> >important than performance. >>> > >>> >> > >>> > >>> >> >>> > >>> >> Most of our internal perf test run on 24/48 core combinations >and >>> >since >>> > >>> >> Octeontx2 event device driver supports a burst size of 1, it will >>> >> show >>> >up as >>> > >>> >> Huge performance degradation. >>> > >>> > >>> > >>> >For the impact on performance, I do the test using software driver, >>> >following are some test results: >>> > >>> >----------------------------------------------------------------------- >>> >---------------- >>> >--------------------------------------------- >>> > >>> >Architecture: aarch64 >>> > >>> >Nics: ixgbe-82599 >>> > >>> >CPU: Cortex-A72 >>> > >>> >BURST_SIZE: 1 >>> > >>> >Order: ./dpdk-test-eventdev -l 0-15 -s 0x2 --vdev=event_sw0 -- -- >>> >test=pipeline_queue --wlcore=4-14 --prod_type_ethdev --stlist=a,a >>> > >>> >Flow: one flow, 64bits package, TX rate: 1.4Mpps >>> > >>> > >>> > >>> >Without this patch: >>> > >>> >0.954 mpps avg 0.953 mpps >>> > >>> > >>> > >>> >With this patch: >>> > >>> >0.932 mpps avg 0.930 mpps >>> > >>> >----------------------------------------------------------------------- >>> >---------------- >>> >--------------------------------------------- >>> > >>> > >>> > >>> >Based on the result above, there is no significant performance >>> >degradation with this patch. >>> > >>> >This is because the release barrier is only for “w- >>>processed_pkts++”. >>> >It just ensures that the worker core >>> > >>> >increments the number events processed after enqueue, and it >>doesn’t >>> >affect dequeue/enqueue: >>> > >>> > >>> > >>> >dequeue -> enqueue -> release barrier -> w->processed_pkts++ >>> > >>> >>> Here enqueue already acts as an explicit release barrier. >>> >>Please refer above reasons. >> >>> > >>> > >>> >On the other hand, I infer the reason for the slight decrease in >>> >measurement performance is that the release barrier >>> > >>> >prevent “w->processed_pkts++” before that the event has been >>processed >>> >(enqueue). But I think this test result is closer >>> > >>> >to the real performance. >>> > >>> >And sorry for that we have no octentx2 device, so there is no test >>> >result on Octeontx2 event device driver. Would you please >>> > >>> >help us test this patch on octentx2 when you are convenient. >Thanks >>> >very much. >>> > >>> >>> I will report the performance numbers on Monday. >>> >> >>That’s great, Thanks very much for your help. We are seeing a ~2% performance loss with the series applied on pipeline tests. I have taken the performance numbers on IXIA to eliminate any discrepancies in stats. Do you see any improvement in stats being reported with and w/o the barrier with Traffic generator vs eventdev application? because I don’t see any difference when verifying with IXIA. Can we skip this patch? Also, note that the barrier wouldn’t help in cases where: 1. SW eventdev Tx routine runs on service core which would have it's on latency in processing the packets. 2. HW eventdev Tx routine that have shaping and scheduling. Regards, Pavan. >> >>Best Regards >>Feifei >> >>> > >>> > >>> >Best Regards >>> > >>> >Feifei >>> >>> Regards, >>> Pavan. >>> >>> > >>> > >>> > >>> >> >>> > >>> >> >So, due to two reasons above, I'm ambivalent about how we >>should >>> >do in >>> > >>> >> >the next step. >>> > >>> >> > >>> > >>> >> >Best Regards >>> > >>> >> >Feifei >>> > >>> >> >>> > >>> >> Regards, >>> > >>> >> Pavan. >>> > >>> >> >>> > >>> >> > >>> > >>> >> >> >Fixes: 314bcf58ca8f ("app/eventdev: add pipeline queue >>worker >>> > >>> >> >> >functions") >>> > >>> >> >> >Cc: >>> >>>mailto:pbhagavat...@marvell.com<mailto:pbhagavatula@marvell.c >o >>m> >>> > >>> >> >> >Cc: mailto:sta...@dpdk.org<mailto:sta...@dpdk.org> >>> > >>> >> >> > >>> > >>> >> >> >Signed-off-by: Phil Yang >>> ><mailto:phil.y...@arm.com<mailto:phil.y...@arm.com>> >>> > >>> >> >> >Signed-off-by: Feifei Wang >>> ><mailto:feifei.wa...@arm.com<mailto:feifei.wa...@arm.com>> >>> > >>> >> >> >Reviewed-by: Ruifeng Wang >>> >><mailto:ruifeng.w...@arm.com<mailto:ruifeng.w...@arm.com>> >>> > >>> >> >> >--- >>> > >>> >> >> > app/test-eventdev/test_pipeline_queue.c | 64 >>> > >>> >> >> >+++++++++++++++++++++---- >>> > >>> >> >> > 1 file changed, 56 insertions(+), 8 deletions(-) >>> > >>> >> >> > >>> > >>> >> >> >diff --git a/app/test-eventdev/test_pipeline_queue.c >>b/app/test- >>> > >>> >> >> >eventdev/test_pipeline_queue.c index 7bebac34f..0c0ec0ceb >>> > >>> >> >100644 >>> > >>> >> >> >--- a/app/test-eventdev/test_pipeline_queue.c >>> > >>> >> >> >+++ b/app/test-eventdev/test_pipeline_queue.c >>> > >>> >> >> >@@ -30,7 +30,13 @@ >>> >pipeline_queue_worker_single_stage_tx(void >>> > >>> >> >> >*arg) >>> > >>> >> >> > >>> > >>> >> >> > if (ev.sched_type == RTE_SCHED_TYPE_ATOMIC) { >>> > >>> >> >> > pipeline_event_tx(dev, port, >>> >> >> > &ev); >>> > >>> >> >> >- w->processed_pkts++; >>> > >>> >> >> >+ >>> > >>> >> >> >+ /* release barrier here ensures >>> >> >> >+ stored >>> >operation >>> > >>> >> >> >+ * of the event completes before >>> >> >> >+ the number >>> >of >>> > >>> >> >> >+ * processed pkts is visible to >>> >> >> >+ the main core >>> > >>> >> >> >+ */ >>> > >>> >> >> >+ >>> >> >> >+ __atomic_fetch_add(&(w->processed_pkts), >>> >1, >>> > >>> >> >> >+ >>> >> >> >+ __ATOMIC_RELEASE); >>> > >>> >> >> > } else { >>> > >>> >> >> > ev.queue_id++; >>> > >>> >> >> > pipeline_fwd_event(&ev, >>> > >>> >> >> >RTE_SCHED_TYPE_ATOMIC); >>> > >>> >> >> >@@ -59,7 +65,13 @@ >>> > >>> >> >pipeline_queue_worker_single_stage_fwd(void >>> > >>> >> >> >*arg) >>> > >>> >> >> > rte_event_eth_tx_adapter_txq_set(ev.mbuf, 0); >>> > >>> >> >> > pipeline_fwd_event(&ev, >>> >> >> > RTE_SCHED_TYPE_ATOMIC); >>> > >>> >> >> > pipeline_event_enqueue(dev, port, &ev); >>> > >>> >> >> >- w->processed_pkts++; >>> > >>> >> >> >+ >>> > >>> >> >> >+ /* release barrier here ensures stored >>> >> >> >+ operation >>> > >>> >> >> >+ * of the event completes before the number of >>> > >>> >> >> >+ * processed pkts is visible to the main core >>> > >>> >> >> >+ */ >>> > >>> >> >> >+ __atomic_fetch_add(&(w->processed_pkts), 1, >>> > >>> >> >> >+ >>> >> >> >+ __ATOMIC_RELEASE); >>> > >>> >> >> > } >>> > >>> >> >> > >>> > >>> >> >> > return 0; >>> > >>> >> >> >@@ -84,7 +96,13 @@ >>> > >>> >> >> >pipeline_queue_worker_single_stage_burst_tx(void *arg) >>> > >>> >> >> > if (ev[i].sched_type == >>> > >>> >> >> >RTE_SCHED_TYPE_ATOMIC) { >>> > >>> >> >> > >>> >> >> > pipeline_event_tx(dev, port, &ev[i]); >>> > >>> >> >> > ev[i].op = >>> >> >> > RTE_EVENT_OP_RELEASE; >>> > >>> >> >> >- >>> >> >> >w->processed_pkts++; >>> > >>> >> >> >+ >>> > >>> >> >> >+ /* release >>> >> >> >+ barrier here ensures stored >>> > >>> >> >> >operation >>> > >>> >> >> >+ * of the event >>> >> >> >+ completes before the >>> > >>> >> >> >number of >>> > >>> >> >> >+ * processed >>> >> >> >+ pkts is visible to the main >>> > >>> >> >> >core >>> > >>> >> >> >+ */ >>> > >>> >> >> >+ >>> >> >> >+ __atomic_fetch_add(&(w- >>> > >>> >> >> >>processed_pkts), 1, >>> > >>> >> >> >+ >>> >__ATOMIC_RELEASE); >>> > >>> >> >> > } else { >>> > >>> >> >> > >>> >> >> > ev[i].queue_id++; >>> > >>> >> >> > >>> >> >> > pipeline_fwd_event(&ev[i], >>> > >>> >> >> >@@ -121,7 +139,13 @@ >>> > >>> >> >> >pipeline_queue_worker_single_stage_burst_fwd(void *arg) >>> > >>> >> >> > } >>> > >>> >> >> > >>> > >>> >> >> > pipeline_event_enqueue_burst(dev, port, ev, >>> >> >> > nb_rx); >>> > >>> >> >> >- w->processed_pkts += nb_rx; >>> > >>> >> >> >+ >>> > >>> >> >> >+ /* release barrier here ensures stored >>> >> >> >+ operation >>> > >>> >> >> >+ * of the event completes before the number of >>> > >>> >> >> >+ * processed pkts is visible to the main core >>> > >>> >> >> >+ */ >>> > >>> >> >> >+ __atomic_fetch_add(&(w->processed_pkts), >nb_rx, >>> > >>> >> >> >+ >>> >> >> >+ __ATOMIC_RELEASE); >>> > >>> >> >> > } >>> > >>> >> >> > >>> > >>> >> >> > return 0; >>> > >>> >> >> >@@ -146,7 +170,13 @@ >>> > >>> >> >pipeline_queue_worker_multi_stage_tx(void >>> > >>> >> >> >*arg) >>> > >>> >> >> > >>> > >>> >> >> > if (ev.queue_id == tx_queue[ev.mbuf->port]) { >>> > >>> >> >> > pipeline_event_tx(dev, port, >>> >> >> > &ev); >>> > >>> >> >> >- w->processed_pkts++; >>> > >>> >> >> >+ >>> > >>> >> >> >+ /* release barrier here ensures >>> >> >> >+ stored >>> >operation >>> > >>> >> >> >+ * of the event completes before >>> >> >> >+ the number >>> >of >>> > >>> >> >> >+ * processed pkts is visible to >>> >> >> >+ the main core >>> > >>> >> >> >+ */ >>> > >>> >> >> >+ >>> >> >> >+ __atomic_fetch_add(&(w->processed_pkts), >>> >1, >>> > >>> >> >> >+ >>> >> >> >+ __ATOMIC_RELEASE); >>> > >>> >> >> > continue; >>> > >>> >> >> > } >>> > >>> >> >> > >>> > >>> >> >> >@@ -180,7 +210,13 @@ >>> > >>> >> >> >pipeline_queue_worker_multi_stage_fwd(void *arg) >>> > >>> >> >> > ev.queue_id = >>> >> >> > tx_queue[ev.mbuf->port]; >>> > >>> >> >> > >>> >> >> > rte_event_eth_tx_adapter_txq_set(ev.mbuf, >>> >0); >>> > >>> >> >> > pipeline_fwd_event(&ev, >>> > >>> >> >> >RTE_SCHED_TYPE_ATOMIC); >>> > >>> >> >> >- w->processed_pkts++; >>> > >>> >> >> >+ >>> > >>> >> >> >+ /* release barrier here ensures >>> >> >> >+ stored >>> >operation >>> > >>> >> >> >+ * of the event completes before >>> >> >> >+ the number >>> >of >>> > >>> >> >> >+ * processed pkts is visible to >>> >> >> >+ the main core >>> > >>> >> >> >+ */ >>> > >>> >> >> >+ >>> >> >> >+ __atomic_fetch_add(&(w->processed_pkts), >>> >1, >>> > >>> >> >> >+ >>> >> >> >+ __ATOMIC_RELEASE); >>> > >>> >> >> > } else { >>> > >>> >> >> > ev.queue_id++; >>> > >>> >> >> > pipeline_fwd_event(&ev, >>> > >>> >> >> >sched_type_list[cq_id]); >>> > >>> >> >> >@@ -214,7 +250,13 @@ >>> > >>> >> >> >pipeline_queue_worker_multi_stage_burst_tx(void *arg) >>> > >>> >> >> > if (ev[i].queue_id == >>> >> >> > tx_queue[ev[i].mbuf- >>> > >>> >> >> >>port]) { >>> > >>> >> >> > >>> >> >> > pipeline_event_tx(dev, port, &ev[i]); >>> > >>> >> >> > ev[i].op = >>> >> >> > RTE_EVENT_OP_RELEASE; >>> > >>> >> >> >- >>> >> >> >w->processed_pkts++; >>> > >>> >> >> >+ >>> > >>> >> >> >+ /* release >>> >> >> >+ barrier here ensures stored >>> > >>> >> >> >operation >>> > >>> >> >> >+ * of the event >>> >> >> >+ completes before the >>> > >>> >> >> >number of >>> > >>> >> >> >+ * processed >>> >> >> >+ pkts is visible to the main >>> > >>> >> >> >core >>> > >>> >> >> >+ */ >>> > >>> >> >> >+ >>> >> >> >+ __atomic_fetch_add(&(w- >>> > >>> >> >> >>processed_pkts), 1, >>> > >>> >> >> >+ >>> >__ATOMIC_RELEASE); >>> > >>> >> >> > continue; >>> > >>> >> >> > } >>> > >>> >> >> > >>> > >>> >> >> >@@ -254,7 +296,13 @@ >>> > >>> >> >> >pipeline_queue_worker_multi_stage_burst_fwd(void *arg) >>> > >>> >> >> > >>> > >>> >> >> > rte_event_eth_tx_adapter_txq_set(ev[i].mbuf, 0); >>> > >>> >> >> > >>> >> >> > pipeline_fwd_event(&ev[i], >>> > >>> >> >> > >>> > >>> >> >> > RTE_SCHED_TYPE_ATOMIC); >>> > >>> >> >> >- >>> >> >> >w->processed_pkts++; >>> > >>> >> >> >+ >>> > >>> >> >> >+ /* release >>> >> >> >+ barrier here ensures stored >>> > >>> >> >> >operation >>> > >>> >> >> >+ * of the event >>> >> >> >+ completes before the >>> > >>> >> >> >number of >>> > >>> >> >> >+ * processed >>> >> >> >+ pkts is visible to the main >>> > >>> >> >> >core >>> > >>> >> >> >+ */ >>> > >>> >> >> >+ >>> >> >> >+ __atomic_fetch_add(&(w- >>> > >>> >> >> >>processed_pkts), 1, >>> > >>> >> >> >+ >>> >__ATOMIC_RELEASE); >>> > >>> >> >> > } else { >>> > >>> >> >> > >>> >> >> > ev[i].queue_id++; >>> > >>> >> >> > >>> >> >> > pipeline_fwd_event(&ev[i], >>> > >>> >> >> >-- >>> > >>> >> >> >2.17.1 >>> >