On 1/29/2024 1:38 PM, Dariusz Sosnowski wrote: > Hi all, > >> -----Original Message----- >> From: Dariusz Sosnowski <dsosnow...@nvidia.com> >> Sent: Tuesday, January 23, 2024 12:37 >> To: Stephen Hemminger <step...@networkplumber.org> >> Cc: NBU-Contact-Thomas Monjalon (EXTERNAL) <tho...@monjalon.net>; >> Ferruh Yigit <ferruh.yi...@amd.com>; Andrew Rybchenko >> <andrew.rybche...@oktetlabs.ru>; Ori Kam <or...@nvidia.com>; >> dev@dpdk.org >> Subject: RE: [RFC] ethdev: fast path async flow API >> >> Hi Stephen, >> >> Sorry for such a late response. >> >> From: Stephen Hemminger <step...@networkplumber.org> >> Sent: Thursday, January 4, 2024 02:08 >>> On Wed, 3 Jan 2024 19:14:49 +0000 >>> Dariusz Sosnowski <dsosnow...@nvidia.com> wrote: >>>> In summary, in my opinion extending the async flow API with bulking >>> capabilities or exposing the queue directly to the application is not >>> desirable. >>>> This proposal aims to reduce the I-cache overhead in async flow API >>>> by >>> reusing the existing design pattern in DPDK - fast path functions are >>> inlined to the application code and they call cached PMD callbacks. >>> >>> Inline needs to more discouraged in DPDK, because it only works if >>> application ends up building with DPDK from source. >>> It doesn't work for the Linux distro packaging model and symbol >>> versioning, etc. >> >> I understand the problem. In that case, I have a proposal. >> I had a chat with Thomas regarding this RFC, and he noticed that there are 2 >> separate changes proposed here: >> >> 1. Per-port callbacks for async flow API. >> - Moves specified callbacks to rte_flow_fp_ops struct and allow PMDs to >> register them dynamically. >> - Removes indirection at API level - no need to call rte_flow_ops_get(). >> - Removes checking if callbacks are defined - either the default DPDK >> callback >> is used or the one provided by PMD. >> 2. Make async flow API functions inlineable. >> >> Change (1) won't break ABI (existing callbacks in rte_flow_ops can be marked >> as deprecated for now and phased out later) and in my opinion is less >> controversial compared to change (2). >> >> I'll rerun the benchmarks for both changes separately to compare their >> benefits in isolation. >> This would allow us to decide if change (2) is worth the drawbacks it >> introduces. >> >> What do you think? > > I split the RFC into 2 parts: > > 1. Introduce per-port callbacks: > - Introduce rte_flow_fp_ops struct - holds callbacks for fast path calls, > for each port. PMD registers callbacks through rte_flow_fp_ops_register(). > - Relevant rte_flow_async_* functions just pass arguments to fast path > callbacks. Validation checks are done only if RTE_FLOW_DEBUG macro is defined. > - Biggest difference is the removal of callback access through > rte_flow_get_ops(). > 2. Inline async flow API functions. > - Relevant rte_flow_async_* functions definitions are moved to rte_flow.h > and made inlineable. > > Here are the results of the benchmark: > > | | Avg Insertion | Diff over baseline | Avg Deletion | > Diff over baseline | > |-----------------------|---------------|--------------------|--------------|--------------------| > | baseline (v24.03-rc0) | 5855.4 | | 9026.8 | > | > | applied (1) | 6384.2 | +528.8 (+9%) | 10054.2 | > +1027.4 (+11.4%) | > | applied (2) | 6434.6 | +579.2 (+9.9%) | 10011.4 | > +984.6 (+10.9%) | > > Results are in KFlows/sec. > The benchmark is continuously inserting and deleting 2M flow rules. > These rules match on IPv4 destination address and with a single action DROP. > Flow rules are inserted and deleted using a single flow queue. > > Change (2) improves insertion rate performance by ~1% compared to (1), but > decreases deletion rate by ~0.5%. > Based on these results, I think we can say that making rte_flow_async_*() > calls inlineable may not be worth it compared to the issues it causes. > > What do you all think about the results? >
Hi Dariusz, As discussed before, converting APIs to static inline functions or exposing 'rte_eth_dev' has cons but with only applying first part above seems get rid of them with reasonable performance gain, so I think we can continue with this approach and continue to reviews. Most of the 'rte_flow_async_*' are already missing the function validity check, so having a default (dummy?) rte_flow_fp_ops for them seems even an improvement there. Only previously 'struct rte_flow_ops' was coming from drivers, ethdev layer doesn't need to maintain anything. But with 'rte_flow_fp_ops' struct, ethdev needs to store another array with fixed ('RTE_MAX_ETHPORTS') size which will be another blocker in the future to convert this fixed arrays to dynamically allocated arrays. For this, does it still help we add an a new field to "struct rte_eth_dev", like "struct rte_flow_ops *flow_ops", similar to 'dev_ops'? The indirection still will be there, but eliminate 'rte_flow_get_ops()' call and checks comes with it. If makes sense, is there a chance to experiment this and get some performance numbers with it?