On Tue, Sep 29, 2020 at 2:32 PM Nicolau, Radu <radu.nico...@intel.com> wrote: > > > On 9/28/2020 5:02 PM, Honnappa Nagarahalli wrote: > > <snip> > >>> Add minimum burst throughout the scheduler pipeline and a flush counter. > >>> Replace ring API calls with local single threaded implementation where > >>> possible. > >>> > >>> Signed-off-by: Radu Nicolau mailto:radu.nico...@intel.com > >>> > >>> Thanks for the patch, a few comments inline. > >>>
> >>> Why not make these APIs part of the rte_ring library? You could further > >> optimize them by keeping the indices on the same cacheline. > >>> I'm not sure there is any need for non thread-safe rings outside this > >> particular case. > >>> [Honnappa] I think if we add the APIs, we will find the use cases. > >>> But, more than that, I understand that rte_ring structure is exposed to > >>> the > >> application. The reason for doing that is the inline functions that > >> rte_ring > >> provides. IMO, we should still maintain modularity and should not use the > >> internals of the rte_ring structure outside of the library. > >>> +1 to that. > >>> > >>> BTW, is there any real perf benefit from such micor-optimisation? > >> I'd tend to view these as use-case specific, and I'm not sure we should > >> clutter > >> up the ring library with yet more functions, especially since they can't be > >> mixed with the existing enqueue/dequeue functions, since they don't use > >> the head pointers. > > IMO, the ring library is pretty organized with the recent addition of > > HTS/RTS modes. This can be one of the modes and should allow us to use the > > existing functions (though additional functions are required as well). > > The other concern I have is, this implementation can be further optimized > > by using a single cache line for the pointers. It uses 2 cache lines just > > because of the layout of the rte_ring structure. > > There was a question earlier about the performance improvements of this > > patch? Are there any % performance improvements that can be shared? > > It is also possible to change the above functions to use the head/tail > > pointers from producer or the consumer cache line alone to check for perf > > differences. > > I don't have a % for the final improvement for this change alone, but > there was some improvement in the memory overhead measurable during > development, which very likely resulted in the whole optimization having > more headroom. > > I agree that this may be further optimized, maybe by having a local > implementation of a ring-like container instead. Have we decided on the next steps for this patch? Is the plan to supersede this patch and have different one in rte_ring subsystem, >