eventdev: add release barriers for pipeline test

Pavan Nikhilesh Bhagavatula Fri, 08 Jan 2021 01:13:09 -0800

Hi Feifei,

>Hi, Pavan
>
>
>
>> -----邮件原件-----
>
>> 发件人: Pavan Nikhilesh Bhagavatula <pbhagavat...@marvell.com>
>
>> 发送时间: 2021年1月5日 17:29
>
>> 收件人: Feifei Wang <feifei.wa...@arm.com>; jer...@marvell.com;
>Harry
>
>> van Haaren <harry.van.haa...@intel.com>
>
>> 抄送: dev@dpdk.org; nd <n...@arm.com>; Honnappa Nagarahalli
>
>> <honnappa.nagaraha...@arm.com>; sta...@dpdk.org; Ruifeng Wang
>
>> <ruifeng.w...@arm.com>; nd <n...@arm.com>
>
>> 主题: RE: [RFC PATCH v1 4/6] app/eventdev: add release barriers for
>pipeline
>
>> test
>
>>
>
>> Hi Feifei,
>
>>
>
>> >Hi, Pavan
>
>> >
>
>> >Sorry for my late reply and thanks very much for your review.
>
>> >
>
>> >> -----Original Message-----
>
>> >> From: Pavan Nikhilesh Bhagavatula
><pbhagavat...@marvell.com<mailto:pbhagavat...@marvell.com>>
>
>> >> Sent: 2020年12月22日 18:33
>
>> >> To: Feifei Wang
><feifei.wa...@arm.com<mailto:feifei.wa...@arm.com>>;
>jer...@marvell.com<mailto:jer...@marvell.com>;
>
>> >Harry van
>
>> >> Haaren
><harry.van.haa...@intel.com<mailto:harry.van.haa...@intel.com>>;
>Pavan Nikhilesh
>
>> >>
><pbhagavat...@caviumnetworks.com<mailto:pbhagavatula@caviumn
>etworks.com>>
>
>> >> Cc: dev@dpdk.org<mailto:dev@dpdk.org>; nd
><n...@arm.com<mailto:n...@arm.com>>; Honnappa Nagarahalli
>
>> >>
><honnappa.nagaraha...@arm.com<mailto:Honnappa.Nagarahalli@arm
>.com>>; sta...@dpdk.org<mailto:sta...@dpdk.org>; Phil Yang
>
>> >> <phil.y...@arm.com<mailto:phil.y...@arm.com>>
>
>> >> Subject: RE: [RFC PATCH v1 4/6] app/eventdev: add release
>barriers
>
>> >for
>
>> >> pipeline test
>
>> >>
>
>> >>
>
>> >> >Add release barriers before updating the processed packets for
>
>> >worker
>
>> >> >lcores to ensure the worker lcore has really finished data
>
>> >> >processing and then it can update the processed packets number.
>
>> >> >
>
>> >>
>
>> >> I believe we can live with minor inaccuracies in stats being
>
>> >> presented
>
>> >as
>
>> >> atomics are pretty heavy when scheduler is limited to burst size as
>1.
>
>> >>
>
>> >> One option is to move it before a pipeline operation
>
>> >(pipeline_event_tx,
>
>> >> pipeline_fwd_event etc.) as they imply implicit release barrier (as
>
>> >> all
>
>> >the
>
>> >> changes done to the event should be visible to the next core).
>
>> >
>
>> >If I understand correctly, your meaning is that move release barriers
>
>> >before pipeline_event_tx or pipeline_fwd_event. This can ensure the
>
>> >event has been processed before the next core begins to tx/fwd. For
>
>> >example:
>
>>
>
>> What I meant was event APIs such as `rte_event_enqueue_burst`,
>
>> `rte_event_eth_tx_adapter_enqueue`
>
>> act as an implicit release barrier and the API
>`rte_event_dequeue_burst` act
>
>> as an implicit acquire barrier.
>
>
>
>>
>
>> Since, pipeline_* test starts with a dequeue() and ends with an
>enqueue() I
>
>> don’t believe we need barriers in Between.
>
>
>
>Sorry for my misunderstanding this. And I agree with you that no
>barriers are
>
>needed between dequeue and enqueue.
>
>
>
>Now, let's go back to the beginning. Actually with this patch, our barrier
>is mainly
>
>for the synchronous variable " w->processed_pkts ". As we all know, the
>event is firstly
>
>dequeued and then enqueued, after this, the event can be treated as
>the processed event
>
>and included in the statistics("w->processed_pkts++").
>
>
>
>Thus, we add a release barrier before " w->processed_pkts++" is to
>prevent this operation
>
>being executed ahead of time. For example:
>
>dequeue  ->  w->processed_pkts++  ->  enqueue
>
>This cause that the worker doesn't actually finish this event processing,
>but the event is treated
>
>as the processed one and included in the statistics.
>


But the current sequence is dequeue-> enqueue-> w->processed_pkts++ and
enqueue already acts as an explicit release barrier right?

>
>
>__________________________________________________________
>____________________
>
>
>
>By the way, I have two other questions about pipeline process test in
>"test_pipeline_queue".
>
>1. when do we start counting processed events (w->processed_pkts)?
>
>For the fwd mode (internal_port = false), when we choose single stage,
>application increments
>
>the number events processed after "pipeline_event enqueue".
>However, when we choose multiple
>
>stage, application increments the number events processed before
>"pipleline_event_enqueue". 

We count an event as process when all the stages have completed and its 
Trasnmitted.

>So,
>
>maybe we can unify this. For example of multiple stage:
>
>
>
>                                if (cq_id == last_queue) {
>
>                                                ev.queue_id = 
> tx_queue[ev.mbuf->port];
>
>                                                
> rte_event_eth_tx_adapter_txq_set(ev.mbuf,
>0);
>
>                                                pipeline_fwd_event(&ev,
>RTE_SCHED_TYPE_ATOMIC);
>
>                                +             pipeline_event_enqueue(dev, 
> port, &ev);
>
>                                                w->processed_pkts++;
>
>                                } else {
>
>                                                ev.queue_id++;
>
>                                                pipeline_fwd_event(&ev,
>sched_type_list[cq_id]);
>
>                                +             pipeline_event_enqueue(dev, 
> port, &ev);
>
>                                }
>
>
>
>                -              pipeline_event_enqueue(dev, port, &ev);
>
>

The above change makes sense. 

>
>2. Whether  "pipeline_event_enqueue" is needed after
>"pipeline_event_tx" for tx mode?
>
>For single_stage_burst_tx mode, after "pipeline_event_tx", the worker
>has to enqueue again
>
>due to  "pipeline_event_enqueue_burst", so maybe we should jump out
>of the loop after
>
>“pipeline_event_tx”,

We call enqueue burst to release the events i.e. enqueue events with 
RTE_EVENT_OP_RELEASE.

>
 for example:
>
>
>
>                                                if (ev[i].sched_type ==
>RTE_SCHED_TYPE_ATOMIC) {
>
>                                                                
> pipeline_event_tx(dev, port, &ev[i]);
>
>                                                                ev[i].op = 
> RTE_EVENT_OP_RELEASE;
>
>                                                                
> w->processed_pkts++;
>
>                                                +             continue;
>
>                                                } else {
>
>                                                                
> ev[i].queue_id++;
>
>                                                                
> pipeline_fwd_event(&ev[i],
>
>
>RTE_SCHED_TYPE_ATOMIC);
>
>                                                }
>
>                                }
>
>
>
>                                pipeline_event_enqueue_burst(dev, port, ev, 
> nb_rx);
>
>
>
>
>
>>
>
>> >
>
>> >if (ev.sched_type == RTE_SCHED_TYPE_ATOMIC) {
>
>> >                          +
>__atomic_thread_fence(__ATOMIC_RELEASE);
>
>> >                                          pipeline_event_tx(dev, port, &ev);
>
>> >                                          w->processed_pkts++;
>
>> >                          } else {
>
>> >                                          ev.queue_id++;
>
>> >                          +
>__atomic_thread_fence(__ATOMIC_RELEASE);
>
>> >                                          pipeline_fwd_event(&ev,
>
>> >RTE_SCHED_TYPE_ATOMIC);
>
>> >                                          pipeline_event_enqueue(dev, port, 
>> > &ev);
>
>> >
>
>> >However, there are two reasons to prevent this:
>
>> >
>
>> >First, compare with other tests in app/eventdev, for example, the
>
>> >eventdev perf test, the wmb is after event operation to ensure
>
>> >operation has been finished and then w->processed_pkts++.
>
>>
>
>> In case of perf_* tests start with a dequeue() and finally ends with a
>
>> mempool_put() should also act as implicit acquire release pairs
>making stats
>
>> consistent?
>
>
>
>For perf tests, this consistency refers to that there is a wmb after
>mempool_put().
>
>Please refer to this link:
>
>https://urldefense.proofpoint.com/v2/url?u=http-
>3A__patches.dpdk.org_patch_85634_&d=DwIGaQ&c=nKjWec2b6R0mO
>yPaz7xtfQ&r=1cjuAHrGh745jHNmj2fD85sUMIJ2IPIDsIJzo6FN6Z0&m=zg
>QHeSDiXWfI1PIIUxXBqMS6E-2_3G46nhrzGXoBpHI&s=0FwTxPXjWflh-
>sdmnkY133IPlJB780x0yxe7Am3JCBw&e=
>
>
>
>>
>
>> >So, if we move release barriers before tx/fwd, it may cause that the
>
>> >tests of app/eventdev become  inconsistent.This may reduce the
>
>> >maintainability of the code and make it difficult to understand.
>
>> >
>
>> >Second, it is a test case, though heavy thread may cause
>performance
>
>> >degradation, it can ensure that the operation process and the test
>
>> >result are correct. And maybe for a test case, correctness is more
>
>> >important than performance.
>
>> >
>
>>
>
>> Most of our internal perf test run on 24/48 core combinations and
>since
>
>> Octeontx2 event device driver supports a burst size of 1, it will show
>up as
>
>> Huge performance degradation.
>
>
>
>For the impact on performance, I do the test using software driver,
>following are some test results:
>
>---------------------------------------------------------------------------------------
>---------------------------------------------
>
>Architecture: aarch64
>
>Nics: ixgbe-82599
>
>CPU: Cortex-A72
>
>BURST_SIZE: 1
>
>Order: ./dpdk-test-eventdev -l 0-15 -s 0x2 --vdev=event_sw0 -- --
>test=pipeline_queue --wlcore=4-14 --prod_type_ethdev --stlist=a,a
>
>Flow: one flow, 64bits package, TX rate: 1.4Mpps
>
>
>
>Without this patch:
>
>0.954 mpps avg 0.953 mpps
>
>
>
>With this patch:
>
>0.932 mpps avg 0.930 mpps
>
>---------------------------------------------------------------------------------------
>---------------------------------------------
>
>
>
>Based on the result above, there is no significant performance
>degradation with this patch.
>
>This is because the release barrier is only for  “w->processed_pkts++”. It
>just ensures that the worker core
>
>increments the number events processed after enqueue, and it doesn’t
>affect dequeue/enqueue:
>
>
>
>dequeue -> enqueue -> release barrier -> w->processed_pkts++
>

Here enqueue already acts as an explicit release barrier.

>
>
>On the other hand, I infer the reason for the slight decrease in
>measurement performance is that the release barrier
>
>prevent “w->processed_pkts++” before that the event has been
>processed (enqueue). But I think this test result is closer
>
>to the real performance.
>
>And sorry for that we have no octentx2 device, so there is no test result
>on Octeontx2 event device driver. Would you please
>
>help us test this patch on octentx2 when you are convenient. Thanks
>very much.
>

I will report the performance numbers on Monday.

>
>
>Best Regards
>
>Feifei

Regards,
Pavan.

>
>
>
>>
>
>> >So, due to two reasons above, I'm ambivalent about how we should
>do in
>
>> >the next step.
>
>> >
>
>> >Best Regards
>
>> >Feifei
>
>>
>
>> Regards,
>
>> Pavan.
>
>>
>
>> >
>
>> >> >Fixes: 314bcf58ca8f ("app/eventdev: add pipeline queue worker
>
>> >> >functions")
>
>> >> >Cc:
>pbhagavat...@marvell.com<mailto:pbhagavat...@marvell.com>
>
>> >> >Cc: sta...@dpdk.org<mailto:sta...@dpdk.org>
>
>> >> >
>
>> >> >Signed-off-by: Phil Yang
><phil.y...@arm.com<mailto:phil.y...@arm.com>>
>
>> >> >Signed-off-by: Feifei Wang
><feifei.wa...@arm.com<mailto:feifei.wa...@arm.com>>
>
>> >> >Reviewed-by: Ruifeng Wang
><ruifeng.w...@arm.com<mailto:ruifeng.w...@arm.com>>
>
>> >> >---
>
>> >> > app/test-eventdev/test_pipeline_queue.c | 64
>
>> >> >+++++++++++++++++++++----
>
>> >> > 1 file changed, 56 insertions(+), 8 deletions(-)
>
>> >> >
>
>> >> >diff --git a/app/test-eventdev/test_pipeline_queue.c b/app/test-
>
>> >> >eventdev/test_pipeline_queue.c index 7bebac34f..0c0ec0ceb
>
>> >100644
>
>> >> >--- a/app/test-eventdev/test_pipeline_queue.c
>
>> >> >+++ b/app/test-eventdev/test_pipeline_queue.c
>
>> >> >@@ -30,7 +30,13 @@
>pipeline_queue_worker_single_stage_tx(void
>
>> >> >*arg)
>
>> >> >
>
>> >> >                    if (ev.sched_type == RTE_SCHED_TYPE_ATOMIC) {
>
>> >> >                                    pipeline_event_tx(dev, port, &ev);
>
>> >> >-                                  w->processed_pkts++;
>
>> >> >+
>
>> >> >+                                 /* release barrier here ensures stored
>operation
>
>> >> >+                                 * of the event completes before the 
>> >> >number
>of
>
>> >> >+                                 * processed pkts is visible to the 
>> >> >main core
>
>> >> >+                                 */
>
>> >> >+                                 
>> >> >__atomic_fetch_add(&(w->processed_pkts),
>1,
>
>> >> >+                                                                 
>> >> >__ATOMIC_RELEASE);
>
>> >> >                    } else {
>
>> >> >                                    ev.queue_id++;
>
>> >> >                                    pipeline_fwd_event(&ev,
>
>> >> >RTE_SCHED_TYPE_ATOMIC);
>
>> >> >@@ -59,7 +65,13 @@
>
>> >pipeline_queue_worker_single_stage_fwd(void
>
>> >> >*arg)
>
>> >> >                    rte_event_eth_tx_adapter_txq_set(ev.mbuf, 0);
>
>> >> >                    pipeline_fwd_event(&ev, RTE_SCHED_TYPE_ATOMIC);
>
>> >> >                    pipeline_event_enqueue(dev, port, &ev);
>
>> >> >-                  w->processed_pkts++;
>
>> >> >+
>
>> >> >+                 /* release barrier here ensures stored operation
>
>> >> >+                 * of the event completes before the number of
>
>> >> >+                 * processed pkts is visible to the main core
>
>> >> >+                 */
>
>> >> >+                 __atomic_fetch_add(&(w->processed_pkts), 1,
>
>> >> >+                                                 __ATOMIC_RELEASE);
>
>> >> >    }
>
>> >> >
>
>> >> >    return 0;
>
>> >> >@@ -84,7 +96,13 @@
>
>> >> >pipeline_queue_worker_single_stage_burst_tx(void *arg)
>
>> >> >                                    if (ev[i].sched_type ==
>
>> >> >RTE_SCHED_TYPE_ATOMIC) {
>
>> >> >                                                    
>> >> > pipeline_event_tx(dev, port, &ev[i]);
>
>> >> >                                                    ev[i].op = 
>> >> > RTE_EVENT_OP_RELEASE;
>
>> >> >-                                                  w->processed_pkts++;
>
>> >> >+
>
>> >> >+                                                 /* release barrier 
>> >> >here ensures stored
>
>> >> >operation
>
>> >> >+                                                 * of the event 
>> >> >completes before the
>
>> >> >number of
>
>> >> >+                                                 * processed pkts is 
>> >> >visible to the main
>
>> >> >core
>
>> >> >+                                                 */
>
>> >> >+                                                 __atomic_fetch_add(&(w-
>
>> >> >>processed_pkts), 1,
>
>> >> >+
>__ATOMIC_RELEASE);
>
>> >> >                                    } else {
>
>> >> >                                                    ev[i].queue_id++;
>
>> >> >                                                    
>> >> > pipeline_fwd_event(&ev[i],
>
>> >> >@@ -121,7 +139,13 @@
>
>> >> >pipeline_queue_worker_single_stage_burst_fwd(void *arg)
>
>> >> >                    }
>
>> >> >
>
>> >> >                    pipeline_event_enqueue_burst(dev, port, ev, nb_rx);
>
>> >> >-                  w->processed_pkts += nb_rx;
>
>> >> >+
>
>> >> >+                 /* release barrier here ensures stored operation
>
>> >> >+                 * of the event completes before the number of
>
>> >> >+                 * processed pkts is visible to the main core
>
>> >> >+                 */
>
>> >> >+                 __atomic_fetch_add(&(w->processed_pkts), nb_rx,
>
>> >> >+                                                 __ATOMIC_RELEASE);
>
>> >> >    }
>
>> >> >
>
>> >> >    return 0;
>
>> >> >@@ -146,7 +170,13 @@
>
>> >pipeline_queue_worker_multi_stage_tx(void
>
>> >> >*arg)
>
>> >> >
>
>> >> >                    if (ev.queue_id == tx_queue[ev.mbuf->port]) {
>
>> >> >                                    pipeline_event_tx(dev, port, &ev);
>
>> >> >-                                  w->processed_pkts++;
>
>> >> >+
>
>> >> >+                                 /* release barrier here ensures stored
>operation
>
>> >> >+                                 * of the event completes before the 
>> >> >number
>of
>
>> >> >+                                 * processed pkts is visible to the 
>> >> >main core
>
>> >> >+                                 */
>
>> >> >+                                 
>> >> >__atomic_fetch_add(&(w->processed_pkts),
>1,
>
>> >> >+                                                                 
>> >> >__ATOMIC_RELEASE);
>
>> >> >                                    continue;
>
>> >> >                    }
>
>> >> >
>
>> >> >@@ -180,7 +210,13 @@
>
>> >> >pipeline_queue_worker_multi_stage_fwd(void *arg)
>
>> >> >                                    ev.queue_id = 
>> >> > tx_queue[ev.mbuf->port];
>
>> >> >                                    
>> >> > rte_event_eth_tx_adapter_txq_set(ev.mbuf,
>0);
>
>> >> >                                    pipeline_fwd_event(&ev,
>
>> >> >RTE_SCHED_TYPE_ATOMIC);
>
>> >> >-                                  w->processed_pkts++;
>
>> >> >+
>
>> >> >+                                 /* release barrier here ensures stored
>operation
>
>> >> >+                                 * of the event completes before the 
>> >> >number
>of
>
>> >> >+                                 * processed pkts is visible to the 
>> >> >main core
>
>> >> >+                                 */
>
>> >> >+                                 
>> >> >__atomic_fetch_add(&(w->processed_pkts),
>1,
>
>> >> >+                                                                 
>> >> >__ATOMIC_RELEASE);
>
>> >> >                    } else {
>
>> >> >                                    ev.queue_id++;
>
>> >> >                                    pipeline_fwd_event(&ev,
>
>> >> >sched_type_list[cq_id]);
>
>> >> >@@ -214,7 +250,13 @@
>
>> >> >pipeline_queue_worker_multi_stage_burst_tx(void *arg)
>
>> >> >                                    if (ev[i].queue_id == 
>> >> > tx_queue[ev[i].mbuf-
>
>> >> >>port]) {
>
>> >> >                                                    
>> >> > pipeline_event_tx(dev, port, &ev[i]);
>
>> >> >                                                    ev[i].op = 
>> >> > RTE_EVENT_OP_RELEASE;
>
>> >> >-                                                  w->processed_pkts++;
>
>> >> >+
>
>> >> >+                                                 /* release barrier 
>> >> >here ensures stored
>
>> >> >operation
>
>> >> >+                                                 * of the event 
>> >> >completes before the
>
>> >> >number of
>
>> >> >+                                                 * processed pkts is 
>> >> >visible to the main
>
>> >> >core
>
>> >> >+                                                 */
>
>> >> >+                                                 __atomic_fetch_add(&(w-
>
>> >> >>processed_pkts), 1,
>
>> >> >+
>__ATOMIC_RELEASE);
>
>> >> >                                                    continue;
>
>> >> >                                    }
>
>> >> >
>
>> >> >@@ -254,7 +296,13 @@
>
>> >> >pipeline_queue_worker_multi_stage_burst_fwd(void *arg)
>
>> >> >
>
>> >> >    rte_event_eth_tx_adapter_txq_set(ev[i].mbuf, 0);
>
>> >> >                                                    
>> >> > pipeline_fwd_event(&ev[i],
>
>> >> >
>
>> >> >    RTE_SCHED_TYPE_ATOMIC);
>
>> >> >-                                                  w->processed_pkts++;
>
>> >> >+
>
>> >> >+                                                 /* release barrier 
>> >> >here ensures stored
>
>> >> >operation
>
>> >> >+                                                 * of the event 
>> >> >completes before the
>
>> >> >number of
>
>> >> >+                                                 * processed pkts is 
>> >> >visible to the main
>
>> >> >core
>
>> >> >+                                                 */
>
>> >> >+                                                 __atomic_fetch_add(&(w-
>
>> >> >>processed_pkts), 1,
>
>> >> >+
>__ATOMIC_RELEASE);
>
>> >> >                                    } else {
>
>> >> >                                                    ev[i].queue_id++;
>
>> >> >                                                    
>> >> > pipeline_fwd_event(&ev[i],
>
>> >> >--
>
>> >> >2.17.1
>

Re: [dpdk-dev] [RFC PATCH v1 4/6] app/eventdev: add release barriers for pipeline test

Reply via email to