Hi Jerin,

> > > >
> > > > 2. Make it safe to remove rx/tx callback at runtime.
> > > > Right now it is not possible for the application to figure out
> > > > when it is safe to free removed callback handle and
> > > > associated with it resources(unless the queue is stopped).
> > > > That's probably not a big problem if all callbacks are static
> > > > hange through whole application lifetime)
> > > > and/or application doesn't allocate any resources for the callback 
> > > > handler.
> > > > Though if callbacks have to be added/removed dynamically and
> > > > callback handler would require more resources to operate properly -
> > > > then it might become an issue.
> > > > So patch #2 fixes that problem - now as soon as
> > > > rte_eth_remove_(rx|tx)_callback() completes successfully, application
> > > > can safely free all associated with the removed callback resources.
> > > >
> > > > Performance impact:
> > > > If application doesn't use RX/TX callbacks, then the tests I run didn't
> > > > reveal any performance degradation.
> > > > Though if application do use RX/TX callbacks - patch #2 does introduce
> > > > some slowdown.
> > > >
> > > > To be more specific here, on BDW (E5-2699 v4) 2.2GHz, 4x10Gb (X520-4)
> > > > with http://dpdk.org/dev/patchwork/patch/31864/ patch installed I got:
> > > > 1) testpmd ... --latencystats=1 - slowdown < 1%
> > > > 2) examples//l3fwd ... --parse-ptype - - slowdown < 1%
> > > > 3) examples/rxtx_callbacks - slowdown ~8%
> > > > All that in terms of packet throughput (Mpps).
> > >
> > > We tried on an arm64 machine; We got following result on host and guest
> > > using l3fwd.
> >
> > Thanks for testing it.
> > So these numbers below - are for l3fwd running with or without rx callback 
> > installed?
> > i.e. with or without '--parse-ptype' option (and on a NIC with/without 
> > ptype HW support)?
> 
> without '--parse-ptype' option(i.e without rx callback installed).
> That makes it sad, Even though application is not using the rx callback,
> We need to pay on a average 3% drop in performance(based on the
> use case/arch/SoC/topology etc)

Yes it is sad...
Do you have any idea why that happens?
Is that the same situation you described before: for arm compiler
often replaces small branches with conditional execution instructions
and miss-prediction for such instructions is not free?
Or is it something else?

Another thing - the series contains 2 main patches:
1) http://dpdk.org/dev/patchwork/patch/31866/
Introduces eth_queue_local and moves callback data into it.
That adds one extra memory reference for rx/tx function to read cb func and 
data.
2) http://dpdk.org/dev/patchwork/patch/31867/
Introduces some synchronization for cb data.

Is it possible to rerun your tests with just patch #1 installed?
Hopefully it would help us to figure out which of the patches (#1 or #2)  
is the main source of slowdown on arm machines.

> 
> >
> > >
> > >
> > > Note:
> > > +ve number means "Performance regression with patches"
> > > -ve number means "Performance improvement with patches"
> > >
> > > Relative performance difference in percentage:
> > >
> > > % difference on host (2 x 40G)
> > > pkt_sz  1c      2c      4c      8c
> > > 64        1.41    0.18    0.51    0.29
> > > 72        0.50    0.53    0.09    0.19
> > > 128       0.31    0.31    0.42    0.00
> > > 256       -0.44   -0.44   0.00    0.00
> > > 512       1.94    1.94    0.01    0.01
> > > 1024      0.00    0.01    0.00    0.01
> > > 1518      0.02    0.01    0.02    0.02
> > >
> > > % difference on guest (2 x 40G)
> > > pkt_sz  1c      2c      4c      8c
> > > 64      5.78      2.30    -2.45   -0.01
> > > 72        1.13    1.13    1.31    -4.29
> > > 128       1.36    1.36    1.54    -0.82
> > > 256       2.02    2.03    -0.01   -0.35
> > > 512       4.05    4.04    0.00    -0.46
> > > 1024      0.39    0.38    0.00    -0.74
> > > 1518      0.00    0.00    -0.05   -4.20
> > >
> > > I think, it is because we added more code under
> > > the RTE_ETHDEV_RXTX_CALLBACKS and it is enabled by
> > > default.
> > >
> > > I think, the impact will vary based on
> > > architecture and underneath i-cache, d-cache and IPC.
> > >
> > > I am just thinking, How we can avoid such impact?
> > > How about,
> > > # Remove RTE_ETHDEV_RXTX_CALLBACKS option
> > > # Hook Rx/TX callbacks under TX/RX offload flags
> > > # ethdev layer gives helper function to invoke
> > > callbacks to driver.
> >
> > I suppose you suggest that user would have to decide at queue_setup()
> > stage does he wants a callback(s) for RX/TX or no?
> 
> Yes. queue_setup or configure() time.
> 
> > As I can see, the main drawback with that approach - it would introduce new 
> > limitation.
> > Right now it is possible to add/remove RX/TX callback at runtime
> > (without stopping and reconfiguring the queue), and I suppose we do want to 
> > keep that ability.
> 
> For those _application_ where it can not decide or it need to add without
> stopping and re configuring the queue still can configure/enable the ethdev
> device with RX/TX callback option(and pay price for it)
> 
> > Again in some cases user can't decide does he wants a callback installed or 
> > not,
> > before he actually calls dev_start().
> > l3fwd ptype-parse callback is an example of such situation:
> > right now user has to start device first to figure out which ptypes are 
> > supported by this PMD
> > with desired configuration, then based on this information he will (or will 
> > not) setup the callback.
> 
> Good point. Can we remove that limitation from PMD perspective? IMO, 
> ptype-parse is more of HW feature,
> All we need to do a lookup(single load) to convert to DPDK packet type.
> http://dpdk.org/browse/dpdk/tree/drivers/net/thunderx/nicvf_rxtx.c#n258
> 
> I see almost all the drivers has dummy check like this, I think, That makes it
> dependency with start().
> 
>         if (dev->rx_pkt_burst == i40e_recv_pkts ||
>             dev->rx_pkt_burst == i40e_recv_pkts_bulk_alloc ||
>             dev->rx_pkt_burst == i40e_recv_scattered_pkts ||
>             dev->rx_pkt_burst == i40e_recv_scattered_pkts_vec ||
>             dev->rx_pkt_burst == i40e_recv_pkts_vec)
>                 return ptypes;
> 

We probably can, but it would require extra work and changes inside the PMDs.
i40e is a good one - all rx functions support the same ptypes.
But let say for ixgbe ARM version of vRX doesn't support all ptypes that x86 vRX
and generic scalar RX do.
For fm10k - vector RX and scalar RX also support different ptype sets.  

> 
> > What I thought in same direction - introduce a notion of 'static' and 
> > 'dynamic' callbacks.
> > Static ones can't be removed while queue is not stopped while dynamic can.
> > But that also requires user to select at queue_setup() stage does he wants 
> > dynamic
> > callbacks for that queue or not (new offload flag or extra field in 
> > rxconf/txconf structure).
> > So again - same limitation for the user.
> > Another possibility here - have a flag 'dynamic' or 'static' not per queue, 
> > but per individual callback.
> > Yes, that would mean API change for add/remove callback functions.
> 
> At least on my test, the regression was caused because adding the code in
> tx_burst and rx_burst under the RTE_ETHDEV_RXTX_CALLBACKS(which enabled
> by default). By disabling the RTE_ETHDEV_RXTX_CALLBACKS option, I don't
> see any regression. If so, Will such regression fixed by introducing
> 'static' or 'dynamic' callbacks? I am happy to test and provide feedback.

Probably no then, but let's try the experiment I mentioned above first.

> 
> For me, I think, The better option is request from application for the
> need for Rx/TX callbacks and set the handler appropriate in the driver to
> have zero impact for the application which does not need callback at all.

It is possible, but:
- that would be quite significant change and would affect each RX and TX 
function   in each PMD.
- from other side - even now there are several libs inside DPDK that rely on 
ability to add/remove
  callbacks on the fly - pdump, latencystats.
  ethdev rx profiling  also relies on ability to install RX callback.

With all these things I presume most apps will configure their devices with 
rx/tx callback support enabled anyway.
Which makes me think - is it worth to introduce such changes at all.
Anyway, my first suggestion would be let's try to figure out what exactly 
causes such big slowdown on ARM.
Konstantin

> 
> at minimum, on those PMDs that support detecting the supported ptypes without
> device start()
> 
> 
> 
> 
> >
> > Konstantin
> >
> > > But, don't invoke from common code
> > > # When application needs the callback support, it
> > > configured the given RX/TX queue offloads
> > > # If the Rx/TX callback configure by the application
> > > then driver calls the helper functions exposed by
> > > the common code to invoke RX callbacks.
> >
> >
> > >
> > > Jerin
> > >
> > > >
> > > > Ability to safely remove callbacks at runtime implies
> > > > some sort of synchronization.
> > > > Even I tried to make it as light as possible,
> > > > probably some slowdown is unavoidable.
> > > > Of course instead of introducing these changes at rte_ethdev layer
> > > > similar technique could be applied on individual callback basis.
> > > > In that case it would be up to callback writer/installer to decide
> > > > does he/she need a removable at runtime callback or not.
> > > > Though in that case, each installed callback might introduce extra
> > > > synchronization overhead and slowdown.
> > > >
> > > > Konstantin Ananyev (3):
> > > >   ethdev: introduce eth_queue_local
> > > >   ethdev: make it safe to remove rx/tx callback at runtime
> > > >   doc: ethdev ABI change deprecation notice
> > > >
> > > >  doc/guides/rel_notes/deprecation.rst |   5 +
> > > >  lib/librte_ether/rte_ethdev.c        | 390 
> > > > ++++++++++++++++++++++-------------
> > > >  lib/librte_ether/rte_ethdev.h        | 174 ++++++++++++----
> > > >  3 files changed, 387 insertions(+), 182 deletions(-)
> > > >
> > > > --
> > > > 2.13.5
> > > >

Reply via email to