On 2023-05-16 15:08, Jerin Jacob wrote:
On Tue, May 16, 2023 at 2:22 AM Mattias Rönnblom <hof...@lysator.liu.se> wrote:

On 2023-05-15 14:38, Jerin Jacob wrote:
On Fri, May 12, 2023 at 6:45 PM Mattias Rönnblom
<mattias.ronnb...@ericsson.com> wrote:

On 2023-05-12 13:59, Jerin Jacob wrote:
On Thu, May 11, 2023 at 2:00 PM Mattias Rönnblom
<mattias.ronnb...@ericsson.com> wrote:

Use non-burst event enqueue and dequeue calls from burst enqueue and
dequeue only when the burst size is compile-time constant (and equal
to one).

Signed-off-by: Mattias Rönnblom <mattias.ronnb...@ericsson.com>

---

v3: Actually include the change v2 claimed to contain.
v2: Wrap builtin call in __extension__, to avoid compiler warnings if
       application is compiled with -pedantic. (Morten Brørup)
---
    lib/eventdev/rte_eventdev.h | 4 ++--
    1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
index a90e23ac8b..a471caeb6d 100644
--- a/lib/eventdev/rte_eventdev.h
+++ b/lib/eventdev/rte_eventdev.h
@@ -1944,7 +1944,7 @@ __rte_event_enqueue_burst(uint8_t dev_id, uint8_t port_id,
            * Allow zero cost non burst mode routine invocation if application
            * requests nb_events as const one
            */
-       if (nb_events == 1)
+       if (__extension__(__builtin_constant_p(nb_events)) && nb_events == 1)

"Why" part is not clear from the commit message. Is this to avoid
nb_events read if it is built-in const.

The __builtin_constant_p() is introduced to avoid having the compiler
generate a conditional branch and two different code paths in case
nb_elem is a run-time variable.

In particular, this matters if nb_elems is run-time variable and varies
between 1 and some larger value.

I should have mention this in the commit message.

A very slight performance improvement. It also makes the code better
match the comment, imo. Zero cost for const one enqueues, but no impact
non-compile-time-constant-length enqueues.

Feel free to ignore.


I did some performance comparison of the patch.
A low-end ARM machines shows 0.7%  drop with single event case. No
regression see with high-end ARM cores with single event case.

IMO, optimizing the check for burst mode(the new patch) may not show
any real improvement as the cost is divided by number of event.
Whereas optimizing the check for single event case(The current code)
shows better performance with single event case and no regression
with burst mode as cost is divided by number of events.

I ran some tests on an AMD Zen 3 with DSW.
In the below tests the enqueue burst size is not compile-time constant.

Enqueue burst size      Performance improvement
Run-time constant 1     ~5%
Run-time constant 2     ~0%
Run-time variable 1-2   ~9%
Run-time variable 1-16  ~0%

The run-time variable enqueue sizes randomly (uniformly) distributed in
the specified range.

The first result may come as a surprise. The benchmark is using
RTE_EVENT_OP_FORWARD type events (which likely is the dominating op type
in most apps). The single-event enqueue function only exists in a
generic variant (i.e., no rte_event_enqueue_forward_burst() equivalent).
I suspect that is the reason for the performance improvement.

This effect is large-enough to make it somewhat beneficial (+~1%) to use
run-time variable single-event enqueue compared to keeping the burst
size compile-time constant.

# Interesting, Could you share your testeventdev command to test it.

I'm using a proprietary benchmark to evaluate the effect of these changes. There's certainly nothing secret about that program, and also nothing very DSW-specific either. I hope to at some point both extend DPDK eventdev tests to include DSW, and also to contribute benchmarks/characteristics tests (perf unit tests or as a separate program), if there seems to be a value in this.

# By having quick glance on DSW code, following change can be added(or
  similar cases).
Not sure such change in DSW driver is making a difference or nor?


diff --git a/drivers/event/dsw/dsw_event.c b/drivers/event/dsw/dsw_event.c
index e84b65d99f..455470997b 100644
--- a/drivers/event/dsw/dsw_event.c
+++ b/drivers/event/dsw/dsw_event.c
@@ -1251,7 +1251,7 @@ dsw_port_flush_out_buffers(struct dsw_evdev
*dsw, struct dsw_port *source_port)
  uint16_t
  dsw_event_enqueue(void *port, const struct rte_event *ev)
  {
-       return dsw_event_enqueue_burst(port, ev, unlikely(ev == NULL) ? 0 : 1);
+       return dsw_event_enqueue_burst(port, ev, 1);

Good point.

Historical note: I think that comparison is old cruft borne out of a misconception, that the single-event enqueue could be called directly from application code, combined with the fact that producer-only ports needed some way to "maintain" a port, prior to the introduction of rte_event_maintain().

  }

  static __rte_always_inline uint16_t
@@ -1340,7 +1340,7 @@ dsw_event_enqueue_burst_generic(struct dsw_port
*source_port,
         return (num_non_release + num_release);
  }

-uint16_t
+inline uint16_t

From what it seems, this does not have the desired effect, at least not on GCC 11.3 (w/ the default DPDK compiler configuration).

I reached this conclusion when I noticed that if I reshuffle the code so to force (not hint) the inlining of the burst (and generic burst) enqueue function into dsw_event_enqueue(), your change performs better.

  dsw_event_enqueue_burst(void *port, const struct rte_event events[],
                         uint16_t events_len)
  {

# I am testing with command like this "/build/app/dpdk-test-eventdev
-l 0-23 -a 0002:0e:00.0 -- --test=perf_atq --plcores 1 --wlcores 8
--stlist p --nb_pkts=10000000000"



I re-ran the compile-time variable, run-time constant enqueue size of 1, and I got the following:

Jerin's change: +4%
Jerin's change + ensure inlining: +6%
RFC v3: +7%

(Here I use a more different setup that produces more deterministic results, hence the different numbers compared to the previous runs. They were using a pipeline spread over two chiplets, and these runs are using only a single chiplet.)

It seems like with your suggested changes you eliminate most of the single-enqueue-special case performance degradation (for DSW), but not all of it. The remaining degradation is very small (for the above case, larger for small by run-time variable enqueue sizes), but it's a little sad that a supposedly performance-enhancing special case (that drives complexity in the code, although not much) actually degrades performance.


The performance gain is counted toward both enqueue and dequeue costs
(+benchmark app overhead), so an under-estimation if see this as an
enqueue performance improvement.

If you agree, then we can skip this patch.


I have no strong opinion if this should be included or not.

It was up to me, I would drop the single-enqueue special case handling
altogether in the next ABI update.

That's a reasonable path. If we are willing to push a patch, we can
test it and give feedback.
Or in our spare time, We can do that as well.


Sure, I'll give it a try.

The next release is an ABI-breaking one?




If so, check should be following. Right?

if (__extension__((__builtin_constant_p(nb_events)) && nb_events == 1)
|| nb_events  == 1)

At least, It was my original intention in the code.



                   return (fp_ops->enqueue)(port, ev);
           else
                   return fn(port, ev, nb_events);
@@ -2200,7 +2200,7 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, 
struct rte_event ev[],
            * Allow zero cost non burst mode routine invocation if application
            * requests nb_events as const one
            */
-       if (nb_events == 1)
+       if (__extension__(__builtin_constant_p(nb_events)) && nb_events == 1)
                   return (fp_ops->dequeue)(port, ev, timeout_ticks);
           else
                   return (fp_ops->dequeue_burst)(port, ev, nb_events,
--
2.34.1


Reply via email to