Hi Ferruh, > -----Original Message----- > From: Ferruh Yigit <ferruh.yi...@amd.com> > Sent: Monday, January 29, 2024 18:36 > To: Dariusz Sosnowski <dsosnow...@nvidia.com>; Stephen Hemminger > <step...@networkplumber.org> > Cc: NBU-Contact-Thomas Monjalon (EXTERNAL) <tho...@monjalon.net>; > Andrew Rybchenko <andrew.rybche...@oktetlabs.ru>; Ori Kam > <or...@nvidia.com>; dev@dpdk.org > Subject: Re: [RFC] ethdev: fast path async flow API > > On 1/29/2024 1:38 PM, Dariusz Sosnowski wrote: > > Hi all, > > > >> -----Original Message----- > >> From: Dariusz Sosnowski <dsosnow...@nvidia.com> > >> Sent: Tuesday, January 23, 2024 12:37 > >> To: Stephen Hemminger <step...@networkplumber.org> > >> Cc: NBU-Contact-Thomas Monjalon (EXTERNAL) <tho...@monjalon.net>; > >> Ferruh Yigit <ferruh.yi...@amd.com>; Andrew Rybchenko > >> <andrew.rybche...@oktetlabs.ru>; Ori Kam <or...@nvidia.com>; > >> dev@dpdk.org > >> Subject: RE: [RFC] ethdev: fast path async flow API > >> > >> Hi Stephen, > >> > >> Sorry for such a late response. > >> > >> From: Stephen Hemminger <step...@networkplumber.org> > >> Sent: Thursday, January 4, 2024 02:08 > >>> On Wed, 3 Jan 2024 19:14:49 +0000 > >>> Dariusz Sosnowski <dsosnow...@nvidia.com> wrote: > >>>> In summary, in my opinion extending the async flow API with bulking > >>> capabilities or exposing the queue directly to the application is not > desirable. > >>>> This proposal aims to reduce the I-cache overhead in async flow API > >>>> by > >>> reusing the existing design pattern in DPDK - fast path functions > >>> are inlined to the application code and they call cached PMD callbacks. > >>> > >>> Inline needs to more discouraged in DPDK, because it only works if > >>> application ends up building with DPDK from source. > >>> It doesn't work for the Linux distro packaging model and symbol > >>> versioning, etc. > >> > >> I understand the problem. In that case, I have a proposal. > >> I had a chat with Thomas regarding this RFC, and he noticed that > >> there are 2 separate changes proposed here: > >> > >> 1. Per-port callbacks for async flow API. > >> - Moves specified callbacks to rte_flow_fp_ops struct and allow > >> PMDs to register them dynamically. > >> - Removes indirection at API level - no need to call > >> rte_flow_ops_get(). > >> - Removes checking if callbacks are defined - either the default > >> DPDK callback is used or the one provided by PMD. > >> 2. Make async flow API functions inlineable. > >> > >> Change (1) won't break ABI (existing callbacks in rte_flow_ops can be > >> marked as deprecated for now and phased out later) and in my opinion > >> is less controversial compared to change (2). > >> > >> I'll rerun the benchmarks for both changes separately to compare > >> their benefits in isolation. > >> This would allow us to decide if change (2) is worth the drawbacks it > >> introduces. > >> > >> What do you think? > > > > I split the RFC into 2 parts: > > > > 1. Introduce per-port callbacks: > > - Introduce rte_flow_fp_ops struct - holds callbacks for fast path > > calls, for > each port. PMD registers callbacks through rte_flow_fp_ops_register(). > > - Relevant rte_flow_async_* functions just pass arguments to fast path > callbacks. Validation checks are done only if RTE_FLOW_DEBUG macro is > defined. > > - Biggest difference is the removal of callback access through > rte_flow_get_ops(). > > 2. Inline async flow API functions. > > - Relevant rte_flow_async_* functions definitions are moved to > > rte_flow.h > and made inlineable. > > > > Here are the results of the benchmark: > > > > | | Avg Insertion | Diff over baseline | Avg > > | Deletion | Diff over baseline | > > |-----------------------|---------------|--------------------|--------------|------------- > -------| > > | baseline (v24.03-rc0) | 5855.4 | | 9026.8 > > | | > > | applied (1) | 6384.2 | +528.8 (+9%) | 10054.2 > > | +1027.4 > (+11.4%) | > > | applied (2) | 6434.6 | +579.2 (+9.9%) | 10011.4 > > | +984.6 > (+10.9%) | > > > > Results are in KFlows/sec. > > The benchmark is continuously inserting and deleting 2M flow rules. > > These rules match on IPv4 destination address and with a single action > DROP. > > Flow rules are inserted and deleted using a single flow queue. > > > > Change (2) improves insertion rate performance by ~1% compared to (1), > but decreases deletion rate by ~0.5%. > > Based on these results, I think we can say that making rte_flow_async_*() > calls inlineable may not be worth it compared to the issues it causes. > > > > What do you all think about the results? > > > > Hi Dariusz, > > As discussed before, converting APIs to static inline functions or exposing > 'rte_eth_dev' has cons but with only applying first part above seems get rid > of > them with reasonable performance gain, so I think we can continue with this > approach and continue to reviews. > > Most of the 'rte_flow_async_*' are already missing the function validity > check, > so having a default (dummy?) rte_flow_fp_ops for them seems even an > improvement there. > > > Only previously 'struct rte_flow_ops' was coming from drivers, ethdev layer > doesn't need to maintain anything. > But with 'rte_flow_fp_ops' struct, ethdev needs to store another array with > fixed ('RTE_MAX_ETHPORTS') size which will be another blocker in the future > to convert this fixed arrays to dynamically allocated arrays. > > For this, does it still help we add an a new field to "struct rte_eth_dev", > like > "struct rte_flow_ops *flow_ops", similar to 'dev_ops'? > The indirection still will be there, but eliminate 'rte_flow_get_ops()' > call and checks comes with it. > If makes sense, is there a chance to experiment this and get some performance > numbers with it? Which option do you have in mind specifically?
1. Keeping only fast path callbacks in "dev->flow_ops", so "struct rte_eth_dev" will hold only a pointer to "struct rte_flow_fp_ops" as defined in RFC. - Only async flow API will use "dev->flow_ops->callback". - Other APIs will go through "rte_flow_ops_get()" 2. Keeping all flow callbacks in "dev->flow_ops", so "struct rte_flow_ops". - As a result, I think that "rte_flow_get_ops()" could be removed altogether assuming that all functions have default implementation, checking for port validity (ENODEV if port not valid) and returning ENOSYS. Regardless of the version, I can experiment with additional indirection. Best regards, Dariusz Sosnowski