On Wed, May 17, 2023 at 12:46 PM Mattias Rönnblom <hof...@lysator.liu.se> wrote: > > On 2023-05-16 15:08, Jerin Jacob wrote: > > On Tue, May 16, 2023 at 2:22 AM Mattias Rönnblom <hof...@lysator.liu.se> > > wrote: > >> > >> On 2023-05-15 14:38, Jerin Jacob wrote: > >>> On Fri, May 12, 2023 at 6:45 PM Mattias Rönnblom > >>> <mattias.ronnb...@ericsson.com> wrote: > >>>> > >>>> On 2023-05-12 13:59, Jerin Jacob wrote: > >>>>> On Thu, May 11, 2023 at 2:00 PM Mattias Rönnblom > >>>>> <mattias.ronnb...@ericsson.com> wrote: > >>>>>> > >>>>>> Use non-burst event enqueue and dequeue calls from burst enqueue and > >>>>>> dequeue only when the burst size is compile-time constant (and equal > >>>>>> to one). > >>>>>> > >>>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnb...@ericsson.com> > >>>>>> > >>>>>> --- > >>>>>> > >>>>>> v3: Actually include the change v2 claimed to contain. > >>>>>> v2: Wrap builtin call in __extension__, to avoid compiler warnings if > >>>>>> application is compiled with -pedantic. (Morten Brørup) > >>>>>> --- > >>>>>> lib/eventdev/rte_eventdev.h | 4 ++-- > >>>>>> 1 file changed, 2 insertions(+), 2 deletions(-) > >>>>>> > >>>>>> diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h > >>>>>> index a90e23ac8b..a471caeb6d 100644 > >>>>>> --- a/lib/eventdev/rte_eventdev.h > >>>>>> +++ b/lib/eventdev/rte_eventdev.h > >>>>>> @@ -1944,7 +1944,7 @@ __rte_event_enqueue_burst(uint8_t dev_id, > >>>>>> uint8_t port_id, > >>>>>> * Allow zero cost non burst mode routine invocation if > >>>>>> application > >>>>>> * requests nb_events as const one > >>>>>> */ > >>>>>> - if (nb_events == 1) > >>>>>> + if (__extension__(__builtin_constant_p(nb_events)) && > >>>>>> nb_events == 1) > >>>>> > >>>>> "Why" part is not clear from the commit message. Is this to avoid > >>>>> nb_events read if it is built-in const. > >>>> > >>>> The __builtin_constant_p() is introduced to avoid having the compiler > >>>> generate a conditional branch and two different code paths in case > >>>> nb_elem is a run-time variable. > >>>> > >>>> In particular, this matters if nb_elems is run-time variable and varies > >>>> between 1 and some larger value. > >>>> > >>>> I should have mention this in the commit message. > >>>> > >>>> A very slight performance improvement. It also makes the code better > >>>> match the comment, imo. Zero cost for const one enqueues, but no impact > >>>> non-compile-time-constant-length enqueues. > >>>> > >>>> Feel free to ignore. > >>> > >>> > >>> I did some performance comparison of the patch. > >>> A low-end ARM machines shows 0.7% drop with single event case. No > >>> regression see with high-end ARM cores with single event case. > >>> > >>> IMO, optimizing the check for burst mode(the new patch) may not show > >>> any real improvement as the cost is divided by number of event. > >>> Whereas optimizing the check for single event case(The current code) > >>> shows better performance with single event case and no regression > >>> with burst mode as cost is divided by number of events. > >> > >> I ran some tests on an AMD Zen 3 with DSW. > >> In the below tests the enqueue burst size is not compile-time constant. > >> > >> Enqueue burst size Performance improvement > >> Run-time constant 1 ~5% > >> Run-time constant 2 ~0% > >> Run-time variable 1-2 ~9% > >> Run-time variable 1-16 ~0% > >> > >> The run-time variable enqueue sizes randomly (uniformly) distributed in > >> the specified range. > >> > >> The first result may come as a surprise. The benchmark is using > >> RTE_EVENT_OP_FORWARD type events (which likely is the dominating op type > >> in most apps). The single-event enqueue function only exists in a > >> generic variant (i.e., no rte_event_enqueue_forward_burst() equivalent). > >> I suspect that is the reason for the performance improvement. > >> > >> This effect is large-enough to make it somewhat beneficial (+~1%) to use > >> run-time variable single-event enqueue compared to keeping the burst > >> size compile-time constant. > > > > # Interesting, Could you share your testeventdev command to test it. > > I'm using a proprietary benchmark to evaluate the effect of these > changes. There's certainly nothing secret about that program, and also > nothing very DSW-specific either. I hope to at some point both extend > DPDK eventdev tests to include DSW, and also to contribute > benchmarks/characteristics tests (perf unit tests or as a separate > program), if there seems to be a value in this.
Yes. Please extend the testeventdev for your use case so that all drivers can test and help to optimize _real world_ cases. Testeventdev already has plugin kind of interface, it should pretty easy to add new MODES. > > > # By having quick glance on DSW code, following change can be added(or > > similar cases). > > Not sure such change in DSW driver is making a difference or nor? > > > > > > diff --git a/drivers/event/dsw/dsw_event.c b/drivers/event/dsw/dsw_event.c > > index e84b65d99f..455470997b 100644 > > --- a/drivers/event/dsw/dsw_event.c > > +++ b/drivers/event/dsw/dsw_event.c > > @@ -1251,7 +1251,7 @@ dsw_port_flush_out_buffers(struct dsw_evdev > > *dsw, struct dsw_port *source_port) > > uint16_t > > dsw_event_enqueue(void *port, const struct rte_event *ev) > > { > > - return dsw_event_enqueue_burst(port, ev, unlikely(ev == NULL) ? 0 : > > 1); > > + return dsw_event_enqueue_burst(port, ev, 1); > > Good point. > > Historical note: I think that comparison is old cruft borne out of a > misconception, that the single-event enqueue could be called directly > from application code, combined with the fact that producer-only ports > needed some way to "maintain" a port, prior to the introduction of > rte_event_maintain(). > > > } > > > > static __rte_always_inline uint16_t > > @@ -1340,7 +1340,7 @@ dsw_event_enqueue_burst_generic(struct dsw_port > > *source_port, > > return (num_non_release + num_release); > > } > > > > -uint16_t > > +inline uint16_t > > From what it seems, this does not have the desired effect, at least not > on GCC 11.3 (w/ the default DPDK compiler configuration). > > I reached this conclusion when I noticed that if I reshuffle the code so > to force (not hint) the inlining of the burst (and generic burst) > enqueue function into dsw_event_enqueue(), your change performs better. > > > dsw_event_enqueue_burst(void *port, const struct rte_event events[], > > uint16_t events_len) > > { > > > > # I am testing with command like this "/build/app/dpdk-test-eventdev > > -l 0-23 -a 0002:0e:00.0 -- --test=perf_atq --plcores 1 --wlcores 8 > > --stlist p --nb_pkts=10000000000" > > > > > I re-ran the compile-time variable, run-time constant enqueue size of 1, > and I got the following: > > Jerin's change: +4% > Jerin's change + ensure inlining: +6% > RFC v3: +7% > > (Here I use a more different setup that produces more deterministic > results, hence the different numbers compared to the previous runs. They > were using a pipeline spread over two chiplets, and these runs are using > only a single chiplet.) > > It seems like with your suggested changes you eliminate most of the > single-enqueue-special case performance degradation (for DSW), but not > all of it. The remaining degradation is very small (for the above case, Cores like AMD Zen 3, I was not expecting 1% diff with such check. e.s.p if it has proper branch predictors. Even pretty low-end arm cores, had around 0.7% diff and new arm cores shows no difference. > larger for small by run-time variable enqueue sizes), but it's a little > sad that a supposedly performance-enhancing special case (that drives > complexity in the code, although not much) actually degrades performance. OK. Let's get rid of fp_ops->dequeue callback. Initial RFC of eventdev has public non burst API, that was the reason for that callback. > > >> > >> The performance gain is counted toward both enqueue and dequeue costs > >> (+benchmark app overhead), so an under-estimation if see this as an > >> enqueue performance improvement. > >> > >>> If you agree, then we can skip this patch. > >>> > >> > >> I have no strong opinion if this should be included or not. > >> > >> It was up to me, I would drop the single-enqueue special case handling > >> altogether in the next ABI update. > > > > That's a reasonable path. If we are willing to push a patch, we can > > test it and give feedback. > > Or in our spare time, We can do that as well. > > > > Sure, I'll give it a try. > > The next release is an ABI-breaking one? Yes (23.11). Please plan to send deprecation notice before 23.07 release. I will mark this patch as rejected in patchwork. Thanks for your time.