Re: [PATCH] doc: announce marking device and driver objects as internal
On 7/10/22 09:17, David Marchand wrote: rte_driver and rte_device are unnecessarily exposed in the public API/ABI. Announce that they will be made opaque in the public API and mark associated API as internal. This impacts all bus, as their driver registration mechanism will be made internal. Note: the PCI bus had a similar deprecation notice that we can remove as the new one is more generic. Signed-off-by: David Marchand Acked-by: Andrew Rybchenko
[PATCH v3] common/mlx5: update DevX error logging
Current PMD logs all DevX errors at error level. DevX interface can fail queue counters allocation on some hardware types. That is a known issue. PMD fallback to VERB API to allocate queue counters when it detects the fault. That DevX failure should not be logged as PMD error. The patch provides DevX with flexible API that selects log level. Signed-off-by: Gregory Etelson Acked-by: Matan Azrad --- v2: fix warnings in old gcc versions v3: remove cc stable --- drivers/common/mlx5/mlx5_devx_cmds.c | 98 +++- 1 file changed, 53 insertions(+), 45 deletions(-) diff --git a/drivers/common/mlx5/mlx5_devx_cmds.c b/drivers/common/mlx5/mlx5_devx_cmds.c index 8880a9f3b5..fb33023138 100644 --- a/drivers/common/mlx5/mlx5_devx_cmds.c +++ b/drivers/common/mlx5/mlx5_devx_cmds.c @@ -20,20 +20,29 @@ #define MLX5_DEVX_ERR_RC(x) ((x) > 0 ? -(x) : ((x) < 0 ? (x) : -1)) -static void -mlx5_devx_err_log(void *out, const char *reason, - const char *param, uint32_t value) -{ - rte_errno = errno; - if (!param) - DRV_LOG(ERR, "DevX %s failed errno=%d status=%#x syndrome=%#x", - reason, errno, MLX5_FW_STATUS(out), - MLX5_FW_SYNDROME(out)); - else - DRV_LOG(ERR, "DevX %s %s=%#X failed errno=%d status=%#x syndrome=%#x", - reason, param, value, errno, MLX5_FW_STATUS(out), - MLX5_FW_SYNDROME(out)); -} +#define DEVX_DRV_LOG(level, out, reason, param, value) \ +do { \ + /* \ +* Some (old) GCC compilers like 7.5.0 and aarch64 GCC 7.1-2017.08 \ +* do not expand correctly when the macro invoked when the `param` \ +* is `NULL`. \ +* Use `local_param` to avoid direct `NULL` expansion. \ +*/ \ + const char *local_param = (const char *)param; \ + \ + rte_errno = errno; \ + if (!local_param) { \ + DRV_LOG(level, \ + "DevX %s failed errno=%d status=%#x syndrome=%#x", \ + (reason), errno, MLX5_FW_STATUS((out)), \ + MLX5_FW_SYNDROME((out))); \ + } else { \ + DRV_LOG(level, \ + "DevX %s %s=%#X failed errno=%d status=%#x syndrome=%#x",\ + (reason), local_param, (value), errno, \ + MLX5_FW_STATUS((out)), MLX5_FW_SYNDROME((out))); \ + } \ +} while (0) static void * mlx5_devx_get_hca_cap(void *ctx, uint32_t *in, uint32_t *out, @@ -49,7 +58,7 @@ mlx5_devx_get_hca_cap(void *ctx, uint32_t *in, uint32_t *out, MLX5_SET(query_hca_cap_in, in, op_mod, flags); rc = mlx5_glue->devx_general_cmd(ctx, in, size_in, out, size_out); if (rc || MLX5_FW_STATUS(out)) { - mlx5_devx_err_log(out, "HCA capabilities", "func", flags >> 1); + DEVX_DRV_LOG(ERR, out, "HCA capabilities", "func", flags >> 1); if (err) *err = MLX5_DEVX_ERR_RC(rc); return NULL; @@ -102,7 +111,7 @@ mlx5_devx_cmd_register_read(void *ctx, uint16_t reg_id, uint32_t arg, MLX5_ST_SZ_BYTES(access_register_out) + sizeof(uint32_t) * dw_cnt); if (rc || MLX5_FW_STATUS(out)) { - mlx5_devx_err_log(out, "read access", "NIC register", reg_id); + DEVX_DRV_LOG(ERR, out, "read access", "NIC register", reg_id); return MLX5_DEVX_ERR_RC(rc); } memcpy(data, &out[MLX5_ST_SZ_DW(access_register_out)], @@ -153,7 +162,7 @@ mlx5_devx_cmd_register_write(void *ctx, uint16_t reg_id, uint32_t arg, memcpy(ptr, data, dw_cnt * sizeof(uint32_t)); rc = mlx5_glue->devx_general_cmd(ctx, in, sizeof(in), out, sizeof(out)); if (rc || MLX5_FW_STATUS(out)) { - mlx5_devx_err_log(out, "write access", "NIC register", reg_id); + DEVX_DRV_LOG(ERR, out, "write access", "NIC register", reg_id); return MLX5_DEVX_ERR_RC(rc); } rc = mlx5_glue->devx_general_cmd(ct
Re: [PATCH v2] ip_frag: add IPv4 fragment copy packet API
> Some NIC drivers support MBUF_FAST_FREE(Device supports optimization > for fast release of mbufs. When set application must guarantee that > per-queue all mbufs comes from the same mempool and has refcnt = 1) > offload. In order to adapt to this offload function, add this API. > Add some test data for this API. > > Signed-off-by: Huichao Cai > --- ... > --- a/lib/ip_frag/rte_ip_frag.h > +++ b/lib/ip_frag/rte_ip_frag.h > @@ -179,6 +179,32 @@ int32_t rte_ipv4_fragment_packet(struct rte_mbuf > *pkt_in, > struct rte_mempool *pool_indirect); > > /** > + * IPv4 fragmentation by copy. > + * > + * This function implements the fragmentation of IPv4 packets by copy. > + * > + * @param pkt_in > + * The input packet. > + * @param pkts_out > + * Array storing the output fragments. > + * @param nb_pkts_out > + * The size of the array that stores the output fragments. > + * @param mtu_size > + * Size in bytes of the Maximum Transfer Unit (MTU) for the outgoing > IPv4 > + * datagrams. This value includes the size of the IPv4 header. > + * @return > + * Upon successful completion - number of output fragments placed > + * in the pkts_out array. > + * Otherwise - (-1) * errno. > + */ > +__rte_experimental > +int32_t > +rte_ipv4_fragment_copy_packet(struct rte_mbuf *pkt_in, > +struct rte_mbuf **pkts_out, > +uint16_t nb_pkts_out, > +uint16_t mtu_size); > + > +/** >* This function implements reassembly of fragmented IPv4 packets. >* Incoming mbufs should have its l2_len/l3_len fields setup correctly. >* > diff --git a/lib/ip_frag/rte_ipv4_fragmentation.c > b/lib/ip_frag/rte_ipv4_fragmentation.c > index a562424..9e050cc 100644 > --- a/lib/ip_frag/rte_ipv4_fragmentation.c > +++ b/lib/ip_frag/rte_ipv4_fragmentation.c > @@ -262,3 +262,168 @@ static inline uint16_t > __create_ipopt_frag_hdr(uint8_t *iph, > > return out_pkt_pos; > } > + > +/** > + * IPv4 fragmentation by copy. > + * > + * This function implements the fragmentation of IPv4 packets by copy. > + * > + * @param pkt_in > + * The input packet. > + * @param pkts_out > + * Array storing the output fragments. > + * @param nb_pkts_out > + * The size of the array that stores the output fragments. > + * @param mtu_size > + * Size in bytes of the Maximum Transfer Unit (MTU) for the outgoing > IPv4 > + * datagrams. This value includes the size of the IPv4 header. > + * @return > + * Upon successful completion - number of output fragments placed > + * in the pkts_out array. > + * Otherwise - (-1) * errno. > + */ > +int32_t > +rte_ipv4_fragment_copy_packet(struct rte_mbuf *pkt_in, > +struct rte_mbuf **pkts_out, > +uint16_t nb_pkts_out, > +uint16_t mtu_size) > +{ > +struct rte_mbuf *in_seg = NULL; > +struct rte_ipv4_hdr *in_hdr; > +uint32_t out_pkt_pos, in_seg_data_pos; > +uint32_t more_in_segs; > +uint16_t fragment_offset, flag_offset, frag_size, header_len; > +uint16_t frag_bytes_remaining; > +uint8_t ipopt_frag_hdr[IPV4_HDR_MAX_LEN]; > +uint16_t ipopt_len; > + > +/* > + * Formal parameter checking. > + */ > +if (unlikely(pkt_in == NULL) || unlikely(pkts_out == NULL) || > +unlikely(nb_pkts_out == 0) || > +unlikely(pkt_in->pool == NULL) || > +unlikely(mtu_size < RTE_ETHER_MIN_MTU)) > +return -EINVAL; > + > +in_hdr = rte_pktmbuf_mtod(pkt_in, struct rte_ipv4_hdr *); > +header_len = (in_hdr->version_ihl & RTE_IPV4_HDR_IHL_MASK) * > +RTE_IPV4_IHL_MULTIPLIER; > + > +/* Check IP header length */ > +if (unlikely(pkt_in->data_len < header_len) || > +unlikely(mtu_size < header_len)) > +return -EINVAL; > + > +/* > + * Ensure the IP payload length of all fragments is aligned to a > + * multiple of 8 bytes as per RFC791 section 2.3. > + */ > +frag_size = RTE_ALIGN_FLOOR((mtu_size - header_len), > +IPV4_HDR_FO_ALIGN); > + > +flag_offset = rte_cpu_to_be_16(in_hdr->fragment_offset); > + > +/* If Don't Fragment flag is set */ > +if (unlikely((flag_offset & IPV4_HDR_DF_MASK) != 0)) > +return -ENOTSUP; > + > +/* Check that pkts_out is big enough to hold all fragments */ > +if (unlikely(frag_size * nb_pkts_out < > +(uint16_t)(pkt_in->pkt_len - header_len))) > +return -EINVAL; > + > +in_seg = pkt_in; > +in_seg_data_pos = header_len; > +out_pkt_pos = 0; > +fragment_offset = 0; > + > +ipopt_len = header_len - sizeof(struct rte_ipv4_hdr); > +if (unlikely(ipopt_len > RTE_IPV4_HDR_OPT_MAX_LEN)) > +return -EINVAL; > + > +more_in_segs = 1; > +while (likely(more_in_segs)) { > +struct rte_mbuf *out_pkt = NULL; > +uint32_t more_out_segs; > +struct rte_ipv4_hdr *out_hdr; > + > +/* Allocate buffer from pkt_in->pool*/ > +out_pkt = rte_pktmbuf_alloc(pkt_in->pool); Instead of implicitly assuming that output mbufs will be allocated from pkt_
RE: [PATCH] doc: announce marking device and driver objects as internal
Hi David, > -Original Message- > From: David Marchand > Sent: Sunday, July 10, 2022 2:18 PM > To: dev@dpdk.org; techbo...@dpdk.org > Cc: Ray Kinsella > Subject: [PATCH] doc: announce marking device and driver objects as > internal > > rte_driver and rte_device are unnecessarily exposed in the public API/ABI. > Announce that they will be made opaque in the public API and mark > associated API as internal. > This impacts all bus, as their driver registration mechanism will be > made internal. > > Note: the PCI bus had a similar deprecation notice that we can remove as > the new one is more generic. > > Signed-off-by: David Marchand > --- > doc/guides/rel_notes/deprecation.rst | 12 +++- > 1 file changed, 7 insertions(+), 5 deletions(-) > > diff --git a/doc/guides/rel_notes/deprecation.rst > b/doc/guides/rel_notes/deprecation.rst > index a9fd6676be..b9cc267b30 100644 > --- a/doc/guides/rel_notes/deprecation.rst > +++ b/doc/guides/rel_notes/deprecation.rst > @@ -38,6 +38,13 @@ Deprecation Notices >external users may still register their bus using a new driver header > (see >``enable_driver_sdk`` meson option). > > +* drivers: As a followup on the work on the ``rte_bus`` object, the > + ``rte_driver`` and ``rte_device`` objects (and as a domino effect, > their > + bus-specific counterparts) will be made opaque in DPDK 22.11. > + Registering a driver on a bus will be marked as an internal API: > + external users may still register their drivers using the bus specific > + driver header (see ``enable_driver_sdk`` meson option). > + Cc SPDK folks Thanks for your work! My only concern is using enable_driver_sdk may not be a good way for SPDK based on the discussion. http://patchwork.dpdk.org/project/dpdk/cover/20210918022443.12719-1-chenbo@intel.com/ But overall this idea makes sense, so: Acked-by: Chenbo Xia > * mempool: Helper macro ``MEMPOOL_HEADER_SIZE()`` is deprecated and will >be removed in DPDK 22.11. The replacement macro >``RTE_MEMPOOL_HEADER_SIZE()`` is internal only. > @@ -49,11 +56,6 @@ Deprecation Notices > * mempool: The mempool API macros ``MEMPOOL_PG_*`` are deprecated and >will be removed in DPDK 22.11. > > -* pci: To reduce unnecessary ABIs exposed by DPDK bus driver, > "rte_bus_pci.h" > - will be made internal in 21.11 and macros/data structures/functions > defined > - in the header will not be considered as ABI anymore. This change is > inspired > - by the RFC https://patchwork.dpdk.org/project/dpdk/list/?series=17176. > - > * lib: will fix extending some enum/define breaking the ABI. There are > multiple >samples in DPDK that enum/define terminated with a ``.*MAX.*`` value > which is >used by iterators, and arrays holding these values are sized with this > -- > 2.36.1
回复: [RFC PATCH v1] net/i40e: put mempool cache out of API
> -邮件原件- > 发件人: Feifei Wang > 发送时间: Wednesday, July 6, 2022 7:36 PM > 收件人: 'Konstantin Ananyev' ; 'Yuying > Zhang' ; 'Beilei Xing' ; > Ruifeng Wang > 抄送: 'dev@dpdk.org' ; nd ; Honnappa > Nagarahalli ; nd ; nd > > 主题: 回复: [RFC PATCH v1] net/i40e: put mempool cache out of API > > > > > -邮件原件- > > 发件人: Feifei Wang > > 发送时间: Wednesday, July 6, 2022 4:53 PM > > 收件人: Konstantin Ananyev ; Yuying > Zhang > > ; Beilei Xing ; Ruifeng > > Wang > > 抄送: dev@dpdk.org; nd ; Honnappa Nagarahalli > > ; nd > > 主题: 回复: [RFC PATCH v1] net/i40e: put mempool cache out of API > > > > > > > > > -邮件原件- > > > 发件人: Konstantin Ananyev > > > 发送时间: Sunday, July 3, 2022 8:20 PM > > > 收件人: Feifei Wang ; Yuying Zhang > > > ; Beilei Xing ; > > > Ruifeng Wang > > > 抄送: dev@dpdk.org; nd ; Honnappa Nagarahalli > > > > > > 主题: Re: [RFC PATCH v1] net/i40e: put mempool cache out of API > > > > > > > > > > Refer to "i40e_tx_free_bufs_avx512", this patch puts mempool cache > > > > out of API to free buffers directly. There are two changes > > > > different with previous version: > > > > 1. change txep from "i40e_entry" to "i40e_vec_entry" > > > > 2. put cache out of "mempool_bulk" API to copy buffers into it > > > > directly > > > > > > > > Performance Test with l3fwd neon path: > > > > with this patch > > > > n1sdp: no perforamnce change > > > > amper-altra:+4.0% > > > > > > > > > > > > Thanks for your detailed comments. > > > > > Thanks for RFC, appreciate your effort. > > > So, as I understand - bypassing mempool put/get itself gives about > > > 7-10% speedup for RX/TX on ARM platforms, correct? > > [Feifei] Yes. [Feifei] Sorry I need to correct this. Actually according to our test in direct-rearm cover letter, the improvement is 7% to 14% on N1SDP and 14% to 17% on Ampere Altra. > > > > > > > > About direct-rearm RX approach you propose: > > > After another thought, probably it is possible to re-arrange it in a > > > way that would help avoid related negatives. > > > The basic idea as follows: > > > > > > 1. Make RXQ sw_ring visible and accessible by 'attached' TX queues. > > > Also make sw_ring de-coupled from RXQ itself, i.e: > > > when RXQ is stopped or even destroyed, related sw_ring may still > > > exist (probably ref-counter or RCU would be sufficient here). > > > All that means we need a common layout/api for rxq_sw_ring > > > and PMDs that would like to support direct-rearming will have to > > > use/obey it. > > [Feifei] de-coupled sw-ring and RXQ may cause dangerous case due to > > RXQ is stopped but elements of it (sw-ring) is still kept and we may > > forget to free this sw-ring in the end. > > Furthermore, if we apply this, we need to separate operation when > > closing RXQ and add Rx sw-ring free operation when closing TXQ. This > > will be complex and it is not conducive to subsequent maintenance if > > maintainer does not understand direct-rearm mode very well. > > > > > > > > 2. Make RXQ sw_ring 'direct' rearming driven by TXQ itself, i.e: > > > at txq_free_bufs() try to store released mbufs inside attached > > > sw_ring directly. If there is no attached sw_ring, or not enough > > > free space in it - continue with mempool_put() as usual. > > > Note that actual arming of HW RXDs still remains responsibility > > > of RX code-path: > > > rxq_rearm(rxq) { > > > ... > > > - check are there are N already filled entries inside rxq_sw_ring. > > > if not, populate them from mempool (usual mempool_get()). > > > - arm related RXDs and mark these sw_ring entries as managed by HW. > > > ... > > > } > > > > > [Feifei] We try to create two modes, one is direct-rearm and the other > > is direct-free like above. > > And by performance comparison, we select direct-rearm which improve > > performance by 7% - 14% compared with direct-free by 3.6% - 7% in n1sdp. > > Furthermore, I think put direct mode in Tx or Rx is equivalent. For > > direct- rearm, if there is no Tx sw-ring, Rx will get mbufs from > > mempool. For direct- fee, if there is no Rx sw-ring, Tx will put mbufs > > into mempool. At last, what affects our decision-making is the > improvement of performance. > > > > > > > > So rxq_sw_ring will serve two purposes: > > > - track mbufs that are managed by HW (that what it does now) > > > - private (per RXQ) mbuf cache > > > > > > Now, if TXQ is stopped while RXQ is running - no extra > > > synchronization is required, RXQ would just use > > > mempool_get() to rearm its sw_ring itself. > > > > > > If RXQ is stopped while TXQ is still running - TXQ can still > > > continue to populate related sw_ring till it gets full. > > > Then it will continue with mempool_put() as usual. > > > Of-course it means that user who wants to use this feature should > > > probably account some extra mbufs for such case, or might be > > > rxq_sw_ring can have enable/disable flag to mitigate such situation. > > > >
RE: [RFC PATCH v1] net/i40e: put mempool cache out of API
> > > > Refer to "i40e_tx_free_bufs_avx512", this patch puts mempool cache out > > of API to free buffers directly. There are two changes different with > > previous version: > > 1. change txep from "i40e_entry" to "i40e_vec_entry" > > 2. put cache out of "mempool_bulk" API to copy buffers into it > > directly > > > > Performance Test with l3fwd neon path: > > with this patch > > n1sdp: no perforamnce change > > amper-altra:+4.0% > > > > > Thanks for RFC, appreciate your effort. > So, as I understand - bypassing mempool put/get itself gives about 7-10% > speedup for RX/TX on ARM platforms, correct? It is 7% to 14% on N1SDP, 14% to 17% on Ampere machines The current RFC needs to be converted to a patch (albeit with zero-copy mempool APIs) as it applies for a more generic use case. > > About direct-rearm RX approach you propose: > After another thought, probably it is possible to re-arrange it in a way that > would help avoid related negatives. Thanks for the idea. > The basic idea as follows: > > 1. Make RXQ sw_ring visible and accessible by 'attached' TX queues. > Also make sw_ring de-coupled from RXQ itself, i.e: > when RXQ is stopped or even destroyed, related sw_ring may still > exist (probably ref-counter or RCU would be sufficient here). > All that means we need a common layout/api for rxq_sw_ring > and PMDs that would like to support direct-rearming will have to > use/obey it. This would mean, we will have additional for loop to fill the descriptors on the RX side. May be we could keep the queue when the RXQ is stopped, and free it only when RXQ is destroyed. Don't have to allocate one more if the RXQ is started back again? This means, the mbufs will be in the sw_ring if the RXQ is stopped but not destroyed (same as what you have mentioned below). > > 2. Make RXQ sw_ring 'direct' rearming driven by TXQ itself, i.e: > at txq_free_bufs() try to store released mbufs inside attached > sw_ring directly. If there is no attached sw_ring, or not enough > free space in it - continue with mempool_put() as usual. > Note that actual arming of HW RXDs still remains responsibility What problems do you see if we arm the HW RXDs (assuming that we do not have to support this across multiple devices)? > of RX code-path: > rxq_rearm(rxq) { > ... > - check are there are N already filled entries inside rxq_sw_ring. > if not, populate them from mempool (usual mempool_get()). > - arm related RXDs and mark these sw_ring entries as managed by HW. > ... > } > > > So rxq_sw_ring will serve two purposes: > - track mbufs that are managed by HW (that what it does now) > - private (per RXQ) mbuf cache > > Now, if TXQ is stopped while RXQ is running - no extra synchronization is > required, RXQ would just use > mempool_get() to rearm its sw_ring itself. There would be some synchronization required which tells the data plane threads not to access the TXQ before the TXQ is stopped. Other than this, no extra synchronization is required. For the current patch, we could use a similar approach. i.e. when TXQ is stopped, it does not free the mbufs from sw_ring, we just let the RXQ consume all the mbufs from TX side sw_ring till it is empty. > > If RXQ is stopped while TXQ is still running - TXQ can still continue to > populate > related sw_ring till it gets full. > Then it will continue with mempool_put() as usual. > Of-course it means that user who wants to use this feature should probably > account some extra mbufs for such case, or might be rxq_sw_ring can have > enable/disable flag to mitigate such situation. > > As another benefit here - such approach makes possible to use several TXQs > (even from different devices) to rearm same RXQ. Being able to use several TXQs is a practical use case. But, I am not sure if different devices in the same server is a practical scenario that we need to address. > > Have to say, that I am still not sure that 10% RX/TX improvement is worth > bypassing mempool completely and introducing all this extra complexity in > RX/TX path. It is more, clarified above. > But, if we'll still decide to go ahead with direct-rearming, this > re-arrangement, I > think, should help to keep things clear and avoid introducing new limitations > in > existing functionality. > > WDYT? I had another idea. This requires the application to change, which might be ok given that it is new feature/optimization. We could expose a new API to the application which allows RX side to take buffers from TX side. The application would call this additional API just before calling the eth_rx_burst API. The advantages I see with this approach is: 1) The static mapping of the ports is not required. The mapping is dynamic. This allows for the application to make decisions based on current conditions in the data plane. 2) Code is simple, because it does not have to check for many condi