Re: [PATCH v3] eventdev: avoid non-burst shortcut for variable-size bursts

Jerin Jacob Wed, 17 May 2023 05:29:03 -0700

On Wed, May 17, 2023 at 12:46 PM Mattias Rönnblom <hof...@lysator.liu.se> wrote:
>
> On 2023-05-16 15:08, Jerin Jacob wrote:
> > On Tue, May 16, 2023 at 2:22 AM Mattias Rönnblom <hof...@lysator.liu.se> 
> > wrote:
> >>
> >> On 2023-05-15 14:38, Jerin Jacob wrote:
> >>> On Fri, May 12, 2023 at 6:45 PM Mattias Rönnblom
> >>> <mattias.ronnb...@ericsson.com> wrote:
> >>>>
> >>>> On 2023-05-12 13:59, Jerin Jacob wrote:
> >>>>> On Thu, May 11, 2023 at 2:00 PM Mattias Rönnblom
> >>>>> <mattias.ronnb...@ericsson.com> wrote:
> >>>>>>
> >>>>>> Use non-burst event enqueue and dequeue calls from burst enqueue and
> >>>>>> dequeue only when the burst size is compile-time constant (and equal
> >>>>>> to one).
> >>>>>>
> >>>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnb...@ericsson.com>
> >>>>>>
> >>>>>> ---
> >>>>>>
> >>>>>> v3: Actually include the change v2 claimed to contain.
> >>>>>> v2: Wrap builtin call in __extension__, to avoid compiler warnings if
> >>>>>>        application is compiled with -pedantic. (Morten Brørup)
> >>>>>> ---
> >>>>>>     lib/eventdev/rte_eventdev.h | 4 ++--
> >>>>>>     1 file changed, 2 insertions(+), 2 deletions(-)
> >>>>>>
> >>>>>> diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
> >>>>>> index a90e23ac8b..a471caeb6d 100644
> >>>>>> --- a/lib/eventdev/rte_eventdev.h
> >>>>>> +++ b/lib/eventdev/rte_eventdev.h
> >>>>>> @@ -1944,7 +1944,7 @@ __rte_event_enqueue_burst(uint8_t dev_id, 
> >>>>>> uint8_t port_id,
> >>>>>>             * Allow zero cost non burst mode routine invocation if 
> >>>>>> application
> >>>>>>             * requests nb_events as const one
> >>>>>>             */
> >>>>>> -       if (nb_events == 1)
> >>>>>> +       if (__extension__(__builtin_constant_p(nb_events)) && 
> >>>>>> nb_events == 1)
> >>>>>
> >>>>> "Why" part is not clear from the commit message. Is this to avoid
> >>>>> nb_events read if it is built-in const.
> >>>>
> >>>> The __builtin_constant_p() is introduced to avoid having the compiler
> >>>> generate a conditional branch and two different code paths in case
> >>>> nb_elem is a run-time variable.
> >>>>
> >>>> In particular, this matters if nb_elems is run-time variable and varies
> >>>> between 1 and some larger value.
> >>>>
> >>>> I should have mention this in the commit message.
> >>>>
> >>>> A very slight performance improvement. It also makes the code better
> >>>> match the comment, imo. Zero cost for const one enqueues, but no impact
> >>>> non-compile-time-constant-length enqueues.
> >>>>
> >>>> Feel free to ignore.
> >>>
> >>>
> >>> I did some performance comparison of the patch.
> >>> A low-end ARM machines shows 0.7%  drop with single event case. No
> >>> regression see with high-end ARM cores with single event case.
> >>>
> >>> IMO, optimizing the check for burst mode(the new patch) may not show
> >>> any real improvement as the cost is divided by number of event.
> >>> Whereas optimizing the check for single event case(The current code)
> >>> shows better performance with single event case and no regression
> >>> with burst mode as cost is divided by number of events.
> >>
> >> I ran some tests on an AMD Zen 3 with DSW.
> >> In the below tests the enqueue burst size is not compile-time constant.
> >>
> >> Enqueue burst size      Performance improvement
> >> Run-time constant 1     ~5%
> >> Run-time constant 2     ~0%
> >> Run-time variable 1-2   ~9%
> >> Run-time variable 1-16  ~0%
> >>
> >> The run-time variable enqueue sizes randomly (uniformly) distributed in
> >> the specified range.
> >>
> >> The first result may come as a surprise. The benchmark is using
> >> RTE_EVENT_OP_FORWARD type events (which likely is the dominating op type
> >> in most apps). The single-event enqueue function only exists in a
> >> generic variant (i.e., no rte_event_enqueue_forward_burst() equivalent).
> >> I suspect that is the reason for the performance improvement.
> >>
> >> This effect is large-enough to make it somewhat beneficial (+~1%) to use
> >> run-time variable single-event enqueue compared to keeping the burst
> >> size compile-time constant.
> >
> > # Interesting, Could you share your testeventdev command to test it.
>
> I'm using a proprietary benchmark to evaluate the effect of these
> changes. There's certainly nothing secret about that program, and also
> nothing very DSW-specific either. I hope to at some point both extend
> DPDK eventdev tests to include DSW, and also to contribute
> benchmarks/characteristics tests (perf unit tests or as a separate
> program), if there seems to be a value in this.


Yes. Please extend the testeventdev for your use case so that all
drivers can test
and help to optimize _real world_ cases. Testeventdev already has
plugin kind of interface,
it should pretty easy to add new MODES.


>
> > # By having quick glance on DSW code, following change can be added(or
> >   similar cases).
> > Not sure such change in DSW driver is making a difference or nor?
> >
> >
> > diff --git a/drivers/event/dsw/dsw_event.c b/drivers/event/dsw/dsw_event.c
> > index e84b65d99f..455470997b 100644
> > --- a/drivers/event/dsw/dsw_event.c
> > +++ b/drivers/event/dsw/dsw_event.c
> > @@ -1251,7 +1251,7 @@ dsw_port_flush_out_buffers(struct dsw_evdev
> > *dsw, struct dsw_port *source_port)
> >   uint16_t
> >   dsw_event_enqueue(void *port, const struct rte_event *ev)
> >   {
> > -       return dsw_event_enqueue_burst(port, ev, unlikely(ev == NULL) ? 0 : 
> > 1);
> > +       return dsw_event_enqueue_burst(port, ev, 1);
>
> Good point.
>
> Historical note: I think that comparison is old cruft borne out of a
> misconception, that the single-event enqueue could be called directly
> from application code, combined with the fact that producer-only ports
> needed some way to "maintain" a port, prior to the introduction of
> rte_event_maintain().
>
> >   }
> >
> >   static __rte_always_inline uint16_t
> > @@ -1340,7 +1340,7 @@ dsw_event_enqueue_burst_generic(struct dsw_port
> > *source_port,
> >          return (num_non_release + num_release);
> >   }
> >
> > -uint16_t
> > +inline uint16_t
>
>  From what it seems, this does not have the desired effect, at least not
> on GCC 11.3 (w/ the default DPDK compiler configuration).
>
> I reached this conclusion when I noticed that if I reshuffle the code so
> to force (not hint) the inlining of the burst (and generic burst)
> enqueue function into dsw_event_enqueue(), your change performs better.
>
> >   dsw_event_enqueue_burst(void *port, const struct rte_event events[],
> >                          uint16_t events_len)
> >   {
> >
> > # I am testing with command like this "/build/app/dpdk-test-eventdev
> > -l 0-23 -a 0002:0e:00.0 -- --test=perf_atq --plcores 1 --wlcores 8
> > --stlist p --nb_pkts=10000000000"
> >
>
>
> I re-ran the compile-time variable, run-time constant enqueue size of 1,
> and I got the following:
>
> Jerin's change: +4%
> Jerin's change + ensure inlining: +6%
> RFC v3: +7%
>
> (Here I use a more different setup that produces more deterministic
> results, hence the different numbers compared to the previous runs. They
> were using a pipeline spread over two chiplets, and these runs are using
> only a single chiplet.)
>
> It seems like with your suggested changes you eliminate most of the
> single-enqueue-special case performance degradation (for DSW), but not
> all of it. The remaining degradation is very small (for the above case,


Cores like AMD Zen 3, I was not expecting 1% diff with such check.
e.s.p if it has proper branch predictors. Even pretty low-end arm
cores, had around
0.7% diff and new arm cores shows no difference.

> larger for small by run-time variable enqueue sizes), but it's a little
> sad that a supposedly performance-enhancing special case (that drives
> complexity in the code, although not much) actually degrades performance.

OK. Let's get rid of fp_ops->dequeue callback. Initial RFC of eventdev
has public non burst API,
that was the reason for that callback.

>
> >>
> >> The performance gain is counted toward both enqueue and dequeue costs
> >> (+benchmark app overhead), so an under-estimation if see this as an
> >> enqueue performance improvement.
> >>
> >>> If you agree, then we can skip this patch.
> >>>
> >>
> >> I have no strong opinion if this should be included or not.
> >>
> >> It was up to me, I would drop the single-enqueue special case handling
> >> altogether in the next ABI update.
> >
> > That's a reasonable path. If we are willing to push a patch, we can
> > test it and give feedback.
> > Or in our spare time, We can do that as well.
> >
>
> Sure, I'll give it a try.
>
> The next release is an ABI-breaking one?

Yes  (23.11). Please plan to send deprecation notice before 23.07 release.

I will mark this patch as rejected in patchwork.

Thanks for your time.

Re: [PATCH v3] eventdev: avoid non-burst shortcut for variable-size bursts

Reply via email to