DPDK - AF_XDP mode running as non-root privilege's
Hi Team, Currently we are exploring our DPDK application using af_xdp in a non-root mode. When the DPDK application invokes AF_XDP, it calls xsk_umem__create() and thus requires CAP_IPC_LOCK capability to be enabled. if the application does not provide CAP_IPC_LOCK capability, it fails in the buffer allocation. Looks like the kernel is reporting ENOBUFS when xsk_umem__create() is invoked without CAP_IPC_LOCK. Can we use AF_XDP without any capability( to run as non-root mode) in our dpdk application ? Please advise us. Please see the code pointers. DPDK logs show eth_rx_queue_setup in rte_eth_af_xdp.c https://elixir.bootlin.com/dpdk/v21.11/source/drivers/net/af_xdp/rte_eth_af_xdp.c#L1346 https://elixir.bootlin.com/dpdk/v21.11/source/drivers/net/af_xdp/rte_eth_af_xdp.c#L1380 https://elixir.bootlin.com/dpdk/v21.11/source/drivers/net/af_xdp/rte_eth_af_xdp.c#L1257 https://elixir.bootlin.com/dpdk/v21.11/source/drivers/net/af_xdp/rte_eth_af_xdp.c#L1065 Please see the below DPDK logs: 12-23-2022 08:40:42.401426 EAL: Restoring previous memory policy: 0 12-23-2022 08:40:42.401443 EAL: request: mp_malloc_sync 12-23-2022 08:40:42.401447 EAL: No shared files mode enabled, IPC is disabled 12-23-2022 08:40:42.401451 EAL: Heap on socket 0 was expanded by 32MB 12-23-2022 08:40:42.402929 eth_rx_queue_setup(): Set up rx queue, rx queue id: 0, xsk queue id: 0 12-23-2022 08:40:42.402973 xdp_umem_configure(): Failed to create umem 12-23-2022 08:40:42.402980 xdp_umem_configure(): ret, error number and string -105, 105, No buffer space available 12-23-2022 08:40:42.402986 eth_rx_queue_setup(): Failed to configure xdp socket 12-23-2022 08:40:42.403013 eth_rx_queue_setup(): Set up rx queue, rx queue id: 0, xsk queue id: 0 12-23-2022 08:40:42.403030 xdp_umem_configure(): Failed to create umem 12-23-2022 08:40:42.403037 xdp_umem_configure(): ret, error number and string -105, 105, No buffer space available 12-23-2022 08:40:42.403042 eth_rx_queue_setup(): Failed to configure xdp socket 12-23-2022 08:40:42.403065 eth_rx_queue_setup(): Set up rx queue, rx queue id: 0, xsk queue id: 0 12-23-2022 08:40:42.403080 xdp_umem_configure(): Failed to create umem 12-23-2022 08:40:42.403085 xdp_umem_configure(): ret, error number and string -105, 105, No buffer space available Thanks, Kamaraj
Re: [PATCH v1 1/2] net/axgbe: add multi-process support
On 12/21/2022 2:52 AM, Jesna K E wrote: > +/* Takes ethdev as parameter > + * Used in dev_start by primary process and then > + * in dev_init by secondary process when attaching to an existing ethdev. > + */ > +void > +axgbe_set_tx_function(struct rte_eth_dev *dev) > +{ > + struct axgbe_port *pdata = dev->data->dev_private; > + struct axgbe_tx_queue *txq = dev->data->tx_queues[0]; > + > + if (pdata->multi_segs_tx) > + dev->tx_pkt_burst = &axgbe_xmit_pkts_seg; > +#ifdef RTE_ARCH_X86 > + if (!txq->vector_disable && > + rte_vect_get_max_simd_bitwidth() >= RTE_VECT_SIMD_128) > + dev->tx_pkt_burst = &axgbe_xmit_pkts_vec; > +#else > + dev->tx_pkt_burst = &axgbe_xmit_pkts; > +#endif > +} 'txq' is used only for 'RTE_ARCH_X86', that is why it gives "unused variable" warning for it. Can you please declare 'txq' within 'RTE_ARCH_X86' macro?
Re: [PATCH v1 1/2] net/axgbe: add multi-process support
On 12/23/2022 10:44 AM, Ferruh Yigit wrote: > On 12/21/2022 2:52 AM, Jesna K E wrote: >> +/* Takes ethdev as parameter >> + * Used in dev_start by primary process and then >> + * in dev_init by secondary process when attaching to an existing ethdev. >> + */ >> +void >> +axgbe_set_tx_function(struct rte_eth_dev *dev) >> +{ >> +struct axgbe_port *pdata = dev->data->dev_private; >> +struct axgbe_tx_queue *txq = dev->data->tx_queues[0]; >> + >> +if (pdata->multi_segs_tx) >> +dev->tx_pkt_burst = &axgbe_xmit_pkts_seg; >> +#ifdef RTE_ARCH_X86 >> +if (!txq->vector_disable && >> +rte_vect_get_max_simd_bitwidth() >= RTE_VECT_SIMD_128) >> +dev->tx_pkt_burst = &axgbe_xmit_pkts_vec; >> +#else >> +dev->tx_pkt_burst = &axgbe_xmit_pkts; btw, indentation of this line looks wrong >> +#endif >> +} > > 'txq' is used only for 'RTE_ARCH_X86', that is why it gives "unused > variable" warning for it. > Can you please declare 'txq' within 'RTE_ARCH_X86' macro?
[RFC PATCH] graph: add support for pcap trace for graph
Packets are captured at each graph node with the node and mbuf metadata as part of the pcap. This is inspired from VPP. Signed-off-by: Amit Prakash Shukla --- doc/guides/sample_app_ug/l3_forward_graph.rst | 3 + examples/l3fwd-graph/main.c | 12 + lib/graph/graph_pcap_trace.c | 337 ++ lib/graph/graph_populate.c| 6 +- lib/graph/meson.build | 5 +- lib/graph/rte_graph_pcap_trace.h | 149 lib/graph/rte_graph_worker.h | 6 + 7 files changed, 515 insertions(+), 3 deletions(-) create mode 100644 lib/graph/graph_pcap_trace.c create mode 100644 lib/graph/rte_graph_pcap_trace.h diff --git a/doc/guides/sample_app_ug/l3_forward_graph.rst b/doc/guides/sample_app_ug/l3_forward_graph.rst index 0a3e0d44ec..cf199bcf81 100644 --- a/doc/guides/sample_app_ug/l3_forward_graph.rst +++ b/doc/guides/sample_app_ug/l3_forward_graph.rst @@ -51,6 +51,7 @@ The application has a number of command line options similar to l3fwd:: [--max-pkt-len PKTLEN] [--no-numa] [--per-port-pool] + [--pcap-enable] Where, @@ -69,6 +70,8 @@ Where, * ``--per-port-pool:`` Optional, set to use independent buffer pools per port. Without this option, single buffer pool is used for all ports. +* ``--pcap-enable:`` Optional, Enables packet capture in pcap format on each node with mbuf and node metadata. + For example, consider a dual processor socket platform with 8 physical cores, where cores 0-7 and 16-23 appear on socket 0, while cores 8-15 and 24-31 appear on socket 1. diff --git a/examples/l3fwd-graph/main.c b/examples/l3fwd-graph/main.c index 6dcb6ee92b..b6408310aa 100644 --- a/examples/l3fwd-graph/main.c +++ b/examples/l3fwd-graph/main.c @@ -404,6 +404,7 @@ static const char short_options[] = "p:" /* portmask */ #define CMD_LINE_OPT_NO_NUMA "no-numa" #define CMD_LINE_OPT_MAX_PKT_LEN "max-pkt-len" #define CMD_LINE_OPT_PER_PORT_POOL "per-port-pool" +#define CMD_LINE_OPT_PCAP_ENABLE "pcap-enable" enum { /* Long options mapped to a short option */ @@ -416,6 +417,7 @@ enum { CMD_LINE_OPT_NO_NUMA_NUM, CMD_LINE_OPT_MAX_PKT_LEN_NUM, CMD_LINE_OPT_PARSE_PER_PORT_POOL, + CMD_LINE_OPT_PARSE_PCAP_ENABLE, }; static const struct option lgopts[] = { @@ -424,6 +426,7 @@ static const struct option lgopts[] = { {CMD_LINE_OPT_NO_NUMA, 0, 0, CMD_LINE_OPT_NO_NUMA_NUM}, {CMD_LINE_OPT_MAX_PKT_LEN, 1, 0, CMD_LINE_OPT_MAX_PKT_LEN_NUM}, {CMD_LINE_OPT_PER_PORT_POOL, 0, 0, CMD_LINE_OPT_PARSE_PER_PORT_POOL}, + {CMD_LINE_OPT_PCAP_ENABLE, 0, 0, CMD_LINE_OPT_PARSE_PCAP_ENABLE}, {NULL, 0, 0, 0}, }; @@ -498,6 +501,11 @@ parse_args(int argc, char **argv) per_port_pool = 1; break; + case CMD_LINE_OPT_PARSE_PCAP_ENABLE: + printf("Packet capture enabled\n"); + set_pcap_trace(1); + break; + default: print_usage(prgname); return -1; @@ -831,6 +839,7 @@ main(int argc, char **argv) local_port_conf.txmode.offloads |= RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE; + local_port_conf.rxmode.offloads |= RTE_ETH_RX_OFFLOAD_RSS_HASH; local_port_conf.rx_adv_conf.rss_conf.rss_hf &= dev_info.flow_type_rss_offloads; if (local_port_conf.rx_adv_conf.rss_conf.rss_hf != @@ -1116,6 +1125,9 @@ main(int argc, char **argv) } /* >8 End of adding route to ip4 graph infa. */ + if (is_pcap_trace_enable()) + rte_graph_pcap_trace_init(); + /* Launch per-lcore init on every worker lcore */ rte_eal_mp_remote_launch(graph_main_loop, NULL, SKIP_MAIN); diff --git a/lib/graph/graph_pcap_trace.c b/lib/graph/graph_pcap_trace.c new file mode 100644 index 00..c5be07de6d --- /dev/null +++ b/lib/graph/graph_pcap_trace.c @@ -0,0 +1,337 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(C) 2022 Marvell International Ltd. + */ + +#include +#include +#include + +#include +#include +#include +#include + +#include "rte_graph_worker.h" + +#define RTE_PCAP_MAJOR_VERSION 1 +#define RTE_PCAP_MINOR_VERSION 0 +#define MAX_PKT_TO_CAPTURE 200 +#define MAX_PCAP_BUF_SZ 2048 + +#define PCAP_DUMP_STR(buf, buf_size, cur_len, ...) \ +do { \ + if (cur_len >= buf_size) \ + break; \ + cur_len += snprintf(buf + cur_len, buf_size - cur_len, __VA_ARGS__);
RE: Strange behavior with rte_pktmbuf_clone cal
That's expected behavior. rte_mbuf_clone() doesn't duplicate your packet data - instead new mbuf just attaches to existing data packet. If you need to modify one (or both) packet data after cloning - you can use rte_pktmbuf_copy() instead. As another alternative - have a look at examples/ip_multicast - It demonstrates how to send sake packet to multiple ip destinations over multiple ports without modifying the entire packet. It looks that you are trying to do something quite similar. Konstantin From: NAGENDRA BALAGANI Sent: Friday, December 23, 2022 7:01 AM To: dev@dpdk.org Cc: Kapil Kumar Jain Subject: Strange behavior with rte_pktmbuf_clone cal Hi, I am seeing strange behavior where rte_pktmbuf_clone is not giving desired result. Here is the detailed info, in my dpdk application , once I received the packet info in mbuf, I need to send the same packet to two destinations, the sequence I should follow is, (i) First, Tunnel the packet to one of desired destination, so I created the shallow copy using rte_pktmbuf_clone, had another mbuf for Outer IP Header for IPinIP tunnel and sent to NIC. (ii)Second, I need to modify the source and destination ip addresses of the packet and send out. The issue, I am seeing is the tunneled packet (clone) have modified IP addresses from (ii). Code flow: Main() { Struct rte_mbuf *org_mbuf; //lets assume this org_mbuf is holding the packet info. (i) Towards First destination. Build_tunnel_packet(org_mbuf) { * Struct rte_mbuf *clone_buffer; * Allocate a clone buffer Clone_buffer = rte_pktmbuf_clone(org_mbuf, clone_pool); * Constructed IPinIP info in another mbuf and prepended in clone_buffer * Call rte_pktmbuf_tx_burst(); } (ii)Towards another destination. Modify_l3_and_route(org_mbuf) { * Modify L3 information of 'org_mbuf' * and Call rte_pkt_mbuf_tx_burst(); } } [cid:image002.jpg@01D916EA.D92B9250] In the above screenshot, the packet 37 should tunneled as it is by adding the outer ip layer(i.e 182.16.146.*), but the inner L3 information also getting changed (which I am modifying in the second step) for some packets. Using, rte_pktmbuf_copy(), solving the issue, but in expense of extra mbuf. Please, help me in understanding what is wrong in the case of rte_pktmbuf_clone()? Regards, Nagendra
RE: [PATCH v2] mempool: micro-optimize put function
> > Hi Morten, > > > > > PING mempool maintainers. > > > > > > If you don't provide ACK or Review to patches, how should Thomas know > > that it is ready for merging? > > > > > > -Morten > > > > The code changes itself looks ok to me. > > Though I don't think it would really bring any noticeable speedup. > > But yes, it looks a bit nicer this way, especially with extra comments. > > Agree, removing the compare and branch instructions from the likely code path > provides no noticeable speedup, but makes the code > more clean. > > > One question though, why do you feel this assert: > > RTE_ASSERT(cache->len <= cache->flushthresh); > > is necessary? > > I could have written it into the comment above it, but instead chose to add > the assertion as an improved comment. > > > I can't see any way it could happen, unless something is totally broken > > within the app. > > Agree. These are the circumstances where assertions can come into action. > With more assertions in the code, it is likely that less code > executes before hitting an assertion, making it easier to find the root cause > of such errors. > > Please note that RTE_ASSERTs are omitted unless built with RTE_ENABLE_ASSERT, > so this assertion is omitted and thus has no > performance impact for a normal build. I am aware that RTE_ASSERT is not enabled by default. My question was more about - why do you feel assert() is necessary in that case in particular. Does it guard for some specific scenario or so. Anyway, I am ok either way: with and without assert() here. Acked-by: Konstantin Ananyev
Re: [RFC PATCH 1/7] lib: add helper to read strings from sysfs files
On Fri, 23 Dec 2022 00:24:29 +0100 Tomasz Duszynski wrote: > diff --git a/lib/eal/version.map b/lib/eal/version.map > index 7ad12a7dc9..2dbc10e6be 100644 > --- a/lib/eal/version.map > +++ b/lib/eal/version.map > @@ -440,6 +440,9 @@ EXPERIMENTAL { > rte_thread_detach; > rte_thread_equal; > rte_thread_join; > + > + # added in 23.03 > + eal_parse_sysfs_string; # WINDOWS_NO_EXPORT > }; Should be internal not exported as user API.
Re: Strange behavior with rte_pktmbuf_clone cal
On Fri, 23 Dec 2022 07:00:40 + NAGENDRA BALAGANI wrote: > Hi, > > I am seeing strange behavior where rte_pktmbuf_clone is not giving desired > result. > Here is the detailed info, in my dpdk application , once I received the > packet info in mbuf, I need to send the same packet to two destinations, the > sequence I should follow is, > > (i) First, Tunnel the packet to one of desired destination, > so I created the shallow copy using rte_pktmbuf_clone, had another mbuf for > Outer IP Header for IPinIP tunnel and sent to NIC. > > (ii)Second, I need to modify the source and destination ip > addresses of the packet and send out. As you showed, that won't work like you think. A shallow clone means that both versions share the same data area. If you modify one one instance both change. The best practice is to consider a clone (or any mbuf with refcount > 1) as read only.
RE: [PATCH v2] mempool cache: add zero-copy get and put functions
> > From: Konstantin Ananyev [mailto:konstantin.anan...@huawei.com] > > Sent: Thursday, 22 December 2022 16.57 > > > > > Zero-copy access to mempool caches is beneficial for PMD performance, > > and > > > must be provided by the mempool library to fix [Bug 1052] without a > > > performance regression. > > > > LGTM in general, thank you for working on it. > > Few comments below. > > > > > > > > [Bug 1052]: https://bugs.dpdk.org/show_bug.cgi?id=1052 > > > > > > v2: > > > * Fix checkpatch warnings. > > > * Fix missing registration of trace points. > > > * The functions are inline, so they don't go into the map file. > > > v1 changes from the RFC: > > > * Removed run-time parameter checks. (Honnappa) > > > This is a hot fast path function; requiring correct application > > > behaviour, i.e. function parameters must be valid. > > > * Added RTE_ASSERT for parameters instead. > > > > RTE_ASSERT(n <= RTE_MEMPOOL_CACHE_MAX_SIZE); > > I think it is too excessive. > > Just: > > if (n <= RTE_MEMPOOL_CACHE_MAX_SIZE) return NULL; > > seems much more convenient for the users here and > > more close to other mempool/ring API behavior. > > In terms of performance - I don’t think one extra comparison here > > would really count. > > The insignificant performance degradation seems like a good tradeoff for > making the function more generic. > I will update the function documentation and place the run-time check here, > so both trace and stats reflect what happened: > > RTE_ASSERT(cache != NULL); > RTE_ASSERT(mp != NULL); > - RTE_ASSERT(n <= RTE_MEMPOOL_CACHE_MAX_SIZE); > > rte_mempool_trace_cache_zc_put_bulk(cache, mp, n); > + > + if (unlikely(n > RTE_MEMPOOL_CACHE_MAX_SIZE)) { > + rte_errno = -ENOSPC; // Or EINVAL? > + return NULL; > + } > > /* Increment stats now, adding in mempool always succeeds. */ > > I will probably also be able to come up with solution for zc_get_bulk(), so > both trace and stats make sense if called with n > > RTE_MEMPOOL_CACHE_MAX_SIZE. > > > > > I also think would be really good to add: > > add zc_(get|put)_bulk_start(), zc_(get|put)_bulk_finish(). > > Where _start would check/fill the cache and return the pointer, > > while _finsih would updathe cache->len. > > Similar to what we have for rte_ring _peek_ API. > > That would allow to extend this API usage - let say inside PMDs > > it could be used not only for MBUF_FAST_FREE case, but for generic > > TX code path (one that have to call rte_mbuf_prefree()) also. > > I don't see a use case for zc_get_start()/_finish(). > > And since the mempool cache is a stack, it would *require* that the > application reads the array in reverse order. In such case, the > function should not return a pointer to the array of objects, but a pointer > to the top of the stack. > > So I prefer to stick with the single-function zero-copy get, i.e. without > start/finish. Yes, it would be more complicated than just update cache->len. I don't have any real use-case for _get_ too - mostly just for symmetry with put. > > > I do agree with you about the use case for zc_put_start()/_finish(). > > Unlike the ring, there is no need for locking with the mempool cache, so we > can implement something much simpler: > > Instead of requiring calling both zc_put_start() and _finish() for every > zero-copy burst, we could add a zc_put_rewind() function, only > to be called if some number of objects were not added anyway: > > /* FIXME: Function documentation here. */ > __rte_experimental > static __rte_always_inline void > rte_mempool_cache_zc_put_rewind(struct rte_mempool_cache *cache, > unsigned int n) > { > RTE_ASSERT(cache != NULL); > RTE_ASSERT(n <= cache->len); > > rte_mempool_trace_cache_zc_put_rewind(cache, n); > > /* Rewind stats. */ > RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, -n); > > cache->len -= n; > } > > I have a strong preference for _rewind() over _start() and _finish(), because > in the full burst case, it only touches the > rte_mempool_cache structure once, whereas splitting it up into _start() and > _finish() touches the rte_mempool_cache structure both > before and after copying the array of objects. > > What do you think? And your concern is that between _get_start(_C_) and get_finish(_C_) the _C_ cache line can be bumped out of CPU Dcache, right? I don't think such situation would be a common one. But, if you think _rewind_ is a better approach - I am ok with it. > I am open for other names than _rewind(), so feel free to speak up if you > have a better name. > > > > > > > Code for this is only generated if built with RTE_ENABLE_ASSERT. > > > * Removed fallback when 'cache' parameter is not set. (Honnappa) > > > * Chose the simple get function; i.e. do not move the existing > > objects in > > > the cache to the top of the new stack, just leave them at the > > bottom. > > > * Renamed the functions. Other