Re: [dpdk-dev] [PATCH v1] event/sw: performance improvements

Nicolau, Radu Wed, 07 Oct 2020 03:45:09 -0700


On 10/6/2020 11:13 AM, Ananyev, Konstantin wrote:

-----Original Message-----
From: Jerin Jacob <[email protected]>
Sent: Monday, October 5, 2020 5:35 PM
To: Nicolau, Radu <[email protected]>
Cc: Honnappa Nagarahalli <[email protected]>; Richardson, Bruce
<[email protected]>; Ananyev, Konstantin
<[email protected]>; Van Haaren, Harry
<[email protected]>; [email protected]; [email protected]; nd
<[email protected]>
Subject: Re: [dpdk-dev] [PATCH v1] event/sw: performance improvements

On Tue, Sep 29, 2020 at 2:32 PM Nicolau, Radu <[email protected]> wrote:


On 9/28/2020 5:02 PM, Honnappa Nagarahalli wrote:

<snip>

Add minimum burst throughout the scheduler pipeline and a flush counter.
Replace ring API calls with local single threaded implementation where
possible.

Signed-off-by: Radu Nicolau mailto:[email protected]

Thanks for the patch, a few comments inline.

Why not make these APIs part of the rte_ring library? You could further

optimize them by keeping the indices on the same cacheline.

I'm not sure there is any need for non thread-safe rings outside this

particular case.

[Honnappa] I think if we add the APIs, we will find the use cases.
But, more than that, I understand that rte_ring structure is exposed to the

application. The reason for doing that is the inline functions that rte_ring
provides. IMO, we should still maintain modularity and should not use the
internals of the rte_ring structure outside of the library.

+1 to that.

BTW, is there any real perf benefit from such micor-optimisation?

I'd tend to view these as use-case specific, and I'm not sure we should clutter
up the ring library with yet more functions, especially since they can't be
mixed with the existing enqueue/dequeue functions, since they don't use
the head pointers.

IMO, the ring library is pretty organized with the recent addition of HTS/RTS

modes. This can be one of the modes and should allow us to use the existing
functions (though additional functions are required as well).

The other concern I have is, this implementation can be further optimized by

using a single cache line for the pointers. It uses 2 cache lines just because 
of the
layout of the rte_ring structure.

There was a question earlier about the performance improvements of this

patch? Are there any % performance improvements that can be shared?

It is also possible to change the above functions to use the head/tail pointers

from producer or the consumer cache line alone to check for perf differences.

I don't have a % for the final improvement for this change alone, but
there was some improvement in the memory overhead measurable during
development, which very likely resulted in the whole optimization having
more headroom.

I agree that this may be further optimized, maybe by having a local
implementation of a ring-like container instead.

Have we decided on the next steps for this patch? Is the plan to
supersede this patch and have different
one in rte_ring subsystem,

My preference is to merge this version of the patch;
1) The ring helper functions are stripped to the SW PMD usage, and not valid to 
use in the general.
2) Adding static inline APIs in an LTS without extensive doesn't seem a good 
idea.

If Honnappa is OK with the above solution for 20.11, we can see about moving 
the rings part of the
code to rte_ring library location in 21.02, and give ourselves some time to 
settle the usage/API before
the next LTS.

As ring library maintainer I share Honnappa concern that another library not 
uses public ring API,
but instead accesses ring internals directly. Obviously such coding practice is 
not welcomed
as it makes harder to maintain/extend ring library in future.
About 2) - these new API can(/shoud) be marked an experimental anyway.
As another thing - it is still unclear what a performance gain we are talking 
about here.
Is it really worth it comparing to just using SP/SC?

The change itself came after I analyzed the memory bound sections of thecode, and I just did a quick test, I got about 3.5% improvement inthroughput, maybe not so much but significant for such a small change,and depending on the usecase it may be more.

As for the implementation itself, I would favour having a custom ringlike container in the PMD code, this will solve the issue of usingrte_ring internals while still allow for full optimisation. If this isacceptable, I will follow up by tomorrow.

Re: [dpdk-dev] [PATCH v1] event/sw: performance improvements

Reply via email to