On Tue, May 16, 2023 at 2:22 AM Mattias Rönnblom <hof...@lysator.liu.se> wrote:
>
> On 2023-05-15 14:38, Jerin Jacob wrote:
> > On Fri, May 12, 2023 at 6:45 PM Mattias Rönnblom
> > <mattias.ronnb...@ericsson.com> wrote:
> >>
> >> On 2023-05-12 13:59, Jerin Jacob wrote:
> >>> On Thu, May 11, 2023 at 2:00 PM Mattias Rönnblom
> >>> <mattias.ronnb...@ericsson.com> wrote:
> >>>>
> >>>> Use non-burst event enqueue and dequeue calls from burst enqueue and
> >>>> dequeue only when the burst size is compile-time constant (and equal
> >>>> to one).
> >>>>
> >>>> Signed-off-by: Mattias Rönnblom <mattias.ronnb...@ericsson.com>
> >>>>
> >>>> ---
> >>>>
> >>>> v3: Actually include the change v2 claimed to contain.
> >>>> v2: Wrap builtin call in __extension__, to avoid compiler warnings if
> >>>>       application is compiled with -pedantic. (Morten Brørup)
> >>>> ---
> >>>>    lib/eventdev/rte_eventdev.h | 4 ++--
> >>>>    1 file changed, 2 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
> >>>> index a90e23ac8b..a471caeb6d 100644
> >>>> --- a/lib/eventdev/rte_eventdev.h
> >>>> +++ b/lib/eventdev/rte_eventdev.h
> >>>> @@ -1944,7 +1944,7 @@ __rte_event_enqueue_burst(uint8_t dev_id, uint8_t 
> >>>> port_id,
> >>>>            * Allow zero cost non burst mode routine invocation if 
> >>>> application
> >>>>            * requests nb_events as const one
> >>>>            */
> >>>> -       if (nb_events == 1)
> >>>> +       if (__extension__(__builtin_constant_p(nb_events)) && nb_events 
> >>>> == 1)
> >>>
> >>> "Why" part is not clear from the commit message. Is this to avoid
> >>> nb_events read if it is built-in const.
> >>
> >> The __builtin_constant_p() is introduced to avoid having the compiler
> >> generate a conditional branch and two different code paths in case
> >> nb_elem is a run-time variable.
> >>
> >> In particular, this matters if nb_elems is run-time variable and varies
> >> between 1 and some larger value.
> >>
> >> I should have mention this in the commit message.
> >>
> >> A very slight performance improvement. It also makes the code better
> >> match the comment, imo. Zero cost for const one enqueues, but no impact
> >> non-compile-time-constant-length enqueues.
> >>
> >> Feel free to ignore.
> >
> >
> > I did some performance comparison of the patch.
> > A low-end ARM machines shows 0.7%  drop with single event case. No
> > regression see with high-end ARM cores with single event case.
> >
> > IMO, optimizing the check for burst mode(the new patch) may not show
> > any real improvement as the cost is divided by number of event.
> > Whereas optimizing the check for single event case(The current code)
> > shows better performance with single event case and no regression
> > with burst mode as cost is divided by number of events.
>
> I ran some tests on an AMD Zen 3 with DSW.
> In the below tests the enqueue burst size is not compile-time constant.
>
> Enqueue burst size      Performance improvement
> Run-time constant 1     ~5%
> Run-time constant 2     ~0%
> Run-time variable 1-2   ~9%
> Run-time variable 1-16  ~0%
>
> The run-time variable enqueue sizes randomly (uniformly) distributed in
> the specified range.
>
> The first result may come as a surprise. The benchmark is using
> RTE_EVENT_OP_FORWARD type events (which likely is the dominating op type
> in most apps). The single-event enqueue function only exists in a
> generic variant (i.e., no rte_event_enqueue_forward_burst() equivalent).
> I suspect that is the reason for the performance improvement.
>
> This effect is large-enough to make it somewhat beneficial (+~1%) to use
> run-time variable single-event enqueue compared to keeping the burst
> size compile-time constant.

# Interesting, Could you share your testeventdev command to test it.
# By having quick glance on DSW code, following change can be added(or
 similar cases).
Not sure such change in DSW driver is making a difference or nor?


diff --git a/drivers/event/dsw/dsw_event.c b/drivers/event/dsw/dsw_event.c
index e84b65d99f..455470997b 100644
--- a/drivers/event/dsw/dsw_event.c
+++ b/drivers/event/dsw/dsw_event.c
@@ -1251,7 +1251,7 @@ dsw_port_flush_out_buffers(struct dsw_evdev
*dsw, struct dsw_port *source_port)
 uint16_t
 dsw_event_enqueue(void *port, const struct rte_event *ev)
 {
-       return dsw_event_enqueue_burst(port, ev, unlikely(ev == NULL) ? 0 : 1);
+       return dsw_event_enqueue_burst(port, ev, 1);
 }

 static __rte_always_inline uint16_t
@@ -1340,7 +1340,7 @@ dsw_event_enqueue_burst_generic(struct dsw_port
*source_port,
        return (num_non_release + num_release);
 }

-uint16_t
+inline uint16_t
 dsw_event_enqueue_burst(void *port, const struct rte_event events[],
                        uint16_t events_len)
 {

# I am testing with command like this "/build/app/dpdk-test-eventdev
-l 0-23 -a 0002:0e:00.0 -- --test=perf_atq --plcores 1 --wlcores 8
--stlist p --nb_pkts=10000000000"

>
> The performance gain is counted toward both enqueue and dequeue costs
> (+benchmark app overhead), so an under-estimation if see this as an
> enqueue performance improvement.
>
> > If you agree, then we can skip this patch.
> >
>
> I have no strong opinion if this should be included or not.
>
> It was up to me, I would drop the single-enqueue special case handling
> altogether in the next ABI update.

That's a reasonable path. If we are willing to push a patch, we can
test it and give feedback.
Or in our spare time, We can do that as well.

>
> >
> >>
> >>> If so, check should be following. Right?
> >>>
> >>> if (__extension__((__builtin_constant_p(nb_events)) && nb_events == 1)
> >>> || nb_events  == 1)
> >>>
> >>> At least, It was my original intention in the code.
> >>>
> >>>
> >>>
> >>>>                   return (fp_ops->enqueue)(port, ev);
> >>>>           else
> >>>>                   return fn(port, ev, nb_events);
> >>>> @@ -2200,7 +2200,7 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t 
> >>>> port_id, struct rte_event ev[],
> >>>>            * Allow zero cost non burst mode routine invocation if 
> >>>> application
> >>>>            * requests nb_events as const one
> >>>>            */
> >>>> -       if (nb_events == 1)
> >>>> +       if (__extension__(__builtin_constant_p(nb_events)) && nb_events 
> >>>> == 1)
> >>>>                   return (fp_ops->dequeue)(port, ev, timeout_ticks);
> >>>>           else
> >>>>                   return (fp_ops->dequeue_burst)(port, ev, nb_events,
> >>>> --
> >>>> 2.34.1
> >>>>
> >>

Reply via email to