> -----Original Message----- > From: Ferruh Yigit <ferruh.yi...@amd.com> > Sent: Tuesday, January 30, 2024 13:17 > To: Dariusz Sosnowski <dsosnow...@nvidia.com>; Stephen Hemminger > <step...@networkplumber.org> > Cc: NBU-Contact-Thomas Monjalon (EXTERNAL) <tho...@monjalon.net>; > Andrew Rybchenko <andrew.rybche...@oktetlabs.ru>; Ori Kam > <or...@nvidia.com>; dev@dpdk.org > Subject: Re: [RFC] ethdev: fast path async flow API > > On 1/30/2024 12:06 PM, Dariusz Sosnowski wrote: > > Hi Ferruh, > > > >> -----Original Message----- > >> From: Ferruh Yigit <ferruh.yi...@amd.com> > >> Sent: Monday, January 29, 2024 18:36 > >> To: Dariusz Sosnowski <dsosnow...@nvidia.com>; Stephen Hemminger > >> <step...@networkplumber.org> > >> Cc: NBU-Contact-Thomas Monjalon (EXTERNAL) <tho...@monjalon.net>; > >> Andrew Rybchenko <andrew.rybche...@oktetlabs.ru>; Ori Kam > >> <or...@nvidia.com>; dev@dpdk.org > >> Subject: Re: [RFC] ethdev: fast path async flow API > >> > >> On 1/29/2024 1:38 PM, Dariusz Sosnowski wrote: > >>> Hi all, > >>> > >>>> -----Original Message----- > >>>> From: Dariusz Sosnowski <dsosnow...@nvidia.com> > >>>> Sent: Tuesday, January 23, 2024 12:37 > >>>> To: Stephen Hemminger <step...@networkplumber.org> > >>>> Cc: NBU-Contact-Thomas Monjalon (EXTERNAL) > <tho...@monjalon.net>; > >>>> Ferruh Yigit <ferruh.yi...@amd.com>; Andrew Rybchenko > >>>> <andrew.rybche...@oktetlabs.ru>; Ori Kam <or...@nvidia.com>; > >>>> dev@dpdk.org > >>>> Subject: RE: [RFC] ethdev: fast path async flow API > >>>> > >>>> Hi Stephen, > >>>> > >>>> Sorry for such a late response. > >>>> > >>>> From: Stephen Hemminger <step...@networkplumber.org> > >>>> Sent: Thursday, January 4, 2024 02:08 > >>>>> On Wed, 3 Jan 2024 19:14:49 +0000 > >>>>> Dariusz Sosnowski <dsosnow...@nvidia.com> wrote: > >>>>>> In summary, in my opinion extending the async flow API with > >>>>>> bulking > >>>>> capabilities or exposing the queue directly to the application is > >>>>> not > >> desirable. > >>>>>> This proposal aims to reduce the I-cache overhead in async flow > >>>>>> API by > >>>>> reusing the existing design pattern in DPDK - fast path functions > >>>>> are inlined to the application code and they call cached PMD callbacks. > >>>>> > >>>>> Inline needs to more discouraged in DPDK, because it only works if > >>>>> application ends up building with DPDK from source. > >>>>> It doesn't work for the Linux distro packaging model and symbol > >>>>> versioning, etc. > >>>> > >>>> I understand the problem. In that case, I have a proposal. > >>>> I had a chat with Thomas regarding this RFC, and he noticed that > >>>> there are 2 separate changes proposed here: > >>>> > >>>> 1. Per-port callbacks for async flow API. > >>>> - Moves specified callbacks to rte_flow_fp_ops struct and allow > >>>> PMDs to register them dynamically. > >>>> - Removes indirection at API level - no need to call > >>>> rte_flow_ops_get(). > >>>> - Removes checking if callbacks are defined - either the > >>>> default DPDK callback is used or the one provided by PMD. > >>>> 2. Make async flow API functions inlineable. > >>>> > >>>> Change (1) won't break ABI (existing callbacks in rte_flow_ops can > >>>> be marked as deprecated for now and phased out later) and in my > >>>> opinion is less controversial compared to change (2). > >>>> > >>>> I'll rerun the benchmarks for both changes separately to compare > >>>> their benefits in isolation. > >>>> This would allow us to decide if change (2) is worth the drawbacks > >>>> it introduces. > >>>> > >>>> What do you think? > >>> > >>> I split the RFC into 2 parts: > >>> > >>> 1. Introduce per-port callbacks: > >>> - Introduce rte_flow_fp_ops struct - holds callbacks for fast > >>> path calls, for > >> each port. PMD registers callbacks through rte_flow_fp_ops_register(). > >>> - Relevant rte_flow_async_* functions just pass arguments to > >>> fast path > >> callbacks. Validation checks are done only if RTE_FLOW_DEBUG macro is > >> defined. > >>> - Biggest difference is the removal of callback access through > >> rte_flow_get_ops(). > >>> 2. Inline async flow API functions. > >>> - Relevant rte_flow_async_* functions definitions are moved to > >>> rte_flow.h > >> and made inlineable. > >>> > >>> Here are the results of the benchmark: > >>> > >>> | | Avg Insertion | Diff over baseline | Avg > >>> | Deletion | Diff over baseline | > >>> |-----------------------|---------------|--------------------|--------------|---------- > --- > >> -------| > >>> | baseline (v24.03-rc0) | 5855.4 | | 9026.8 > >>> | | > >>> | applied (1) | 6384.2 | +528.8 (+9%) | 10054.2 > >>> | +1027.4 > >> (+11.4%) | > >>> | applied (2) | 6434.6 | +579.2 (+9.9%) | 10011.4 > >>> | +984.6 > >> (+10.9%) | > >>> > >>> Results are in KFlows/sec. > >>> The benchmark is continuously inserting and deleting 2M flow rules. > >>> These rules match on IPv4 destination address and with a single > >>> action > >> DROP. > >>> Flow rules are inserted and deleted using a single flow queue. > >>> > >>> Change (2) improves insertion rate performance by ~1% compared to > >>> (1), > >> but decreases deletion rate by ~0.5%. > >>> Based on these results, I think we can say that making > >>> rte_flow_async_*() > >> calls inlineable may not be worth it compared to the issues it causes. > >>> > >>> What do you all think about the results? > >>> > >> > >> Hi Dariusz, > >> > >> As discussed before, converting APIs to static inline functions or > >> exposing 'rte_eth_dev' has cons but with only applying first part > >> above seems get rid of them with reasonable performance gain, so I > >> think we can continue with this approach and continue to reviews. > >> > >> Most of the 'rte_flow_async_*' are already missing the function > >> validity check, so having a default (dummy?) rte_flow_fp_ops for them > >> seems even an improvement there. > >> > >> > >> Only previously 'struct rte_flow_ops' was coming from drivers, ethdev > >> layer doesn't need to maintain anything. > >> But with 'rte_flow_fp_ops' struct, ethdev needs to store another > >> array with fixed ('RTE_MAX_ETHPORTS') size which will be another > >> blocker in the future to convert this fixed arrays to dynamically allocated > arrays. > >> > >> For this, does it still help we add an a new field to "struct > >> rte_eth_dev", like "struct rte_flow_ops *flow_ops", similar to 'dev_ops'? > >> The indirection still will be there, but eliminate 'rte_flow_get_ops()' > >> call and checks comes with it. > >> If makes sense, is there a chance to experiment this and get some > >> performance numbers with it? > > Which option do you have in mind specifically? > > > > 1. Keeping only fast path callbacks in "dev->flow_ops", so "struct > rte_eth_dev" will hold only a pointer to "struct rte_flow_fp_ops" as defined > in > RFC. > > - Only async flow API will use "dev->flow_ops->callback". > > - Other APIs will go through "rte_flow_ops_get()" > > 2. Keeping all flow callbacks in "dev->flow_ops", so "struct rte_flow_ops". > > - As a result, I think that "rte_flow_get_ops()" could be removed > altogether assuming that all functions have default implementation, > > checking for port validity (ENODEV if port not valid) and returning > ENOSYS. > > > > Regardless of the version, I can experiment with additional indirection. > > > > > > Both can work, has some cons, > > I think "rte_flow_ops_get()" is more clear approach and rte_flow APIs already > using it, so no change is required, but using separate struct for fast path > will > create two different structs drivers needs to fill, and both uses slightly > different > way to populate which is not nice. > > For this RFC I think we can go with option 1, and consider updating rest if > there is more motivation for it.
Ok, sounds good to me. I redid the benchmark with "dev->flow_ops->callback()" method. There's no statistically significant difference between this and static array of rte_flow_fp_ops, so I think we can go with option 1. I'll send the patch for review as soon as possible. Best regards, Dariusz Sosnowski