Hi Jerin, > > > > > > > > 2. Make it safe to remove rx/tx callback at runtime. > > > > Right now it is not possible for the application to figure out > > > > when it is safe to free removed callback handle and > > > > associated with it resources(unless the queue is stopped). > > > > That's probably not a big problem if all callbacks are static > > > > hange through whole application lifetime) > > > > and/or application doesn't allocate any resources for the callback > > > > handler. > > > > Though if callbacks have to be added/removed dynamically and > > > > callback handler would require more resources to operate properly - > > > > then it might become an issue. > > > > So patch #2 fixes that problem - now as soon as > > > > rte_eth_remove_(rx|tx)_callback() completes successfully, application > > > > can safely free all associated with the removed callback resources. > > > > > > > > Performance impact: > > > > If application doesn't use RX/TX callbacks, then the tests I run didn't > > > > reveal any performance degradation. > > > > Though if application do use RX/TX callbacks - patch #2 does introduce > > > > some slowdown. > > > > > > > > To be more specific here, on BDW (E5-2699 v4) 2.2GHz, 4x10Gb (X520-4) > > > > with http://dpdk.org/dev/patchwork/patch/31864/ patch installed I got: > > > > 1) testpmd ... --latencystats=1 - slowdown < 1% > > > > 2) examples//l3fwd ... --parse-ptype - - slowdown < 1% > > > > 3) examples/rxtx_callbacks - slowdown ~8% > > > > All that in terms of packet throughput (Mpps). > > > > > > We tried on an arm64 machine; We got following result on host and guest > > > using l3fwd. > > > > Thanks for testing it. > > So these numbers below - are for l3fwd running with or without rx callback > > installed? > > i.e. with or without '--parse-ptype' option (and on a NIC with/without > > ptype HW support)? > > without '--parse-ptype' option(i.e without rx callback installed). > That makes it sad, Even though application is not using the rx callback, > We need to pay on a average 3% drop in performance(based on the > use case/arch/SoC/topology etc)
Yes it is sad... Do you have any idea why that happens? Is that the same situation you described before: for arm compiler often replaces small branches with conditional execution instructions and miss-prediction for such instructions is not free? Or is it something else? Another thing - the series contains 2 main patches: 1) http://dpdk.org/dev/patchwork/patch/31866/ Introduces eth_queue_local and moves callback data into it. That adds one extra memory reference for rx/tx function to read cb func and data. 2) http://dpdk.org/dev/patchwork/patch/31867/ Introduces some synchronization for cb data. Is it possible to rerun your tests with just patch #1 installed? Hopefully it would help us to figure out which of the patches (#1 or #2) is the main source of slowdown on arm machines. > > > > > > > > > > > > Note: > > > +ve number means "Performance regression with patches" > > > -ve number means "Performance improvement with patches" > > > > > > Relative performance difference in percentage: > > > > > > % difference on host (2 x 40G) > > > pkt_sz 1c 2c 4c 8c > > > 64 1.41 0.18 0.51 0.29 > > > 72 0.50 0.53 0.09 0.19 > > > 128 0.31 0.31 0.42 0.00 > > > 256 -0.44 -0.44 0.00 0.00 > > > 512 1.94 1.94 0.01 0.01 > > > 1024 0.00 0.01 0.00 0.01 > > > 1518 0.02 0.01 0.02 0.02 > > > > > > % difference on guest (2 x 40G) > > > pkt_sz 1c 2c 4c 8c > > > 64 5.78 2.30 -2.45 -0.01 > > > 72 1.13 1.13 1.31 -4.29 > > > 128 1.36 1.36 1.54 -0.82 > > > 256 2.02 2.03 -0.01 -0.35 > > > 512 4.05 4.04 0.00 -0.46 > > > 1024 0.39 0.38 0.00 -0.74 > > > 1518 0.00 0.00 -0.05 -4.20 > > > > > > I think, it is because we added more code under > > > the RTE_ETHDEV_RXTX_CALLBACKS and it is enabled by > > > default. > > > > > > I think, the impact will vary based on > > > architecture and underneath i-cache, d-cache and IPC. > > > > > > I am just thinking, How we can avoid such impact? > > > How about, > > > # Remove RTE_ETHDEV_RXTX_CALLBACKS option > > > # Hook Rx/TX callbacks under TX/RX offload flags > > > # ethdev layer gives helper function to invoke > > > callbacks to driver. > > > > I suppose you suggest that user would have to decide at queue_setup() > > stage does he wants a callback(s) for RX/TX or no? > > Yes. queue_setup or configure() time. > > > As I can see, the main drawback with that approach - it would introduce new > > limitation. > > Right now it is possible to add/remove RX/TX callback at runtime > > (without stopping and reconfiguring the queue), and I suppose we do want to > > keep that ability. > > For those _application_ where it can not decide or it need to add without > stopping and re configuring the queue still can configure/enable the ethdev > device with RX/TX callback option(and pay price for it) > > > Again in some cases user can't decide does he wants a callback installed or > > not, > > before he actually calls dev_start(). > > l3fwd ptype-parse callback is an example of such situation: > > right now user has to start device first to figure out which ptypes are > > supported by this PMD > > with desired configuration, then based on this information he will (or will > > not) setup the callback. > > Good point. Can we remove that limitation from PMD perspective? IMO, > ptype-parse is more of HW feature, > All we need to do a lookup(single load) to convert to DPDK packet type. > http://dpdk.org/browse/dpdk/tree/drivers/net/thunderx/nicvf_rxtx.c#n258 > > I see almost all the drivers has dummy check like this, I think, That makes it > dependency with start(). > > if (dev->rx_pkt_burst == i40e_recv_pkts || > dev->rx_pkt_burst == i40e_recv_pkts_bulk_alloc || > dev->rx_pkt_burst == i40e_recv_scattered_pkts || > dev->rx_pkt_burst == i40e_recv_scattered_pkts_vec || > dev->rx_pkt_burst == i40e_recv_pkts_vec) > return ptypes; > We probably can, but it would require extra work and changes inside the PMDs. i40e is a good one - all rx functions support the same ptypes. But let say for ixgbe ARM version of vRX doesn't support all ptypes that x86 vRX and generic scalar RX do. For fm10k - vector RX and scalar RX also support different ptype sets. > > > What I thought in same direction - introduce a notion of 'static' and > > 'dynamic' callbacks. > > Static ones can't be removed while queue is not stopped while dynamic can. > > But that also requires user to select at queue_setup() stage does he wants > > dynamic > > callbacks for that queue or not (new offload flag or extra field in > > rxconf/txconf structure). > > So again - same limitation for the user. > > Another possibility here - have a flag 'dynamic' or 'static' not per queue, > > but per individual callback. > > Yes, that would mean API change for add/remove callback functions. > > At least on my test, the regression was caused because adding the code in > tx_burst and rx_burst under the RTE_ETHDEV_RXTX_CALLBACKS(which enabled > by default). By disabling the RTE_ETHDEV_RXTX_CALLBACKS option, I don't > see any regression. If so, Will such regression fixed by introducing > 'static' or 'dynamic' callbacks? I am happy to test and provide feedback. Probably no then, but let's try the experiment I mentioned above first. > > For me, I think, The better option is request from application for the > need for Rx/TX callbacks and set the handler appropriate in the driver to > have zero impact for the application which does not need callback at all. It is possible, but: - that would be quite significant change and would affect each RX and TX function in each PMD. - from other side - even now there are several libs inside DPDK that rely on ability to add/remove callbacks on the fly - pdump, latencystats. ethdev rx profiling also relies on ability to install RX callback. With all these things I presume most apps will configure their devices with rx/tx callback support enabled anyway. Which makes me think - is it worth to introduce such changes at all. Anyway, my first suggestion would be let's try to figure out what exactly causes such big slowdown on ARM. Konstantin > > at minimum, on those PMDs that support detecting the supported ptypes without > device start() > > > > > > > > Konstantin > > > > > But, don't invoke from common code > > > # When application needs the callback support, it > > > configured the given RX/TX queue offloads > > > # If the Rx/TX callback configure by the application > > > then driver calls the helper functions exposed by > > > the common code to invoke RX callbacks. > > > > > > > > > > Jerin > > > > > > > > > > > Ability to safely remove callbacks at runtime implies > > > > some sort of synchronization. > > > > Even I tried to make it as light as possible, > > > > probably some slowdown is unavoidable. > > > > Of course instead of introducing these changes at rte_ethdev layer > > > > similar technique could be applied on individual callback basis. > > > > In that case it would be up to callback writer/installer to decide > > > > does he/she need a removable at runtime callback or not. > > > > Though in that case, each installed callback might introduce extra > > > > synchronization overhead and slowdown. > > > > > > > > Konstantin Ananyev (3): > > > > ethdev: introduce eth_queue_local > > > > ethdev: make it safe to remove rx/tx callback at runtime > > > > doc: ethdev ABI change deprecation notice > > > > > > > > doc/guides/rel_notes/deprecation.rst | 5 + > > > > lib/librte_ether/rte_ethdev.c | 390 > > > > ++++++++++++++++++++++------------- > > > > lib/librte_ether/rte_ethdev.h | 174 ++++++++++++---- > > > > 3 files changed, 387 insertions(+), 182 deletions(-) > > > > > > > > -- > > > > 2.13.5 > > > >