DPDK - AF_XDP mode running as non-root privilege's

2022-12-23 Thread Kamaraj P
Hi Team,

Currently we are exploring our DPDK application using af_xdp in a non-root
mode.
When the DPDK application invokes AF_XDP, it calls xsk_umem__create() and
thus requires CAP_IPC_LOCK capability to be enabled. if the application
does not provide CAP_IPC_LOCK capability, it fails in the buffer
allocation.

Looks like the kernel is reporting ENOBUFS when xsk_umem__create() is
invoked without CAP_IPC_LOCK.
Can we use AF_XDP without any capability( to run as non-root mode) in our
dpdk application ?
Please advise us.

Please see the code pointers.
DPDK logs show eth_rx_queue_setup in rte_eth_af_xdp.c

https://elixir.bootlin.com/dpdk/v21.11/source/drivers/net/af_xdp/rte_eth_af_xdp.c#L1346
https://elixir.bootlin.com/dpdk/v21.11/source/drivers/net/af_xdp/rte_eth_af_xdp.c#L1380
https://elixir.bootlin.com/dpdk/v21.11/source/drivers/net/af_xdp/rte_eth_af_xdp.c#L1257
https://elixir.bootlin.com/dpdk/v21.11/source/drivers/net/af_xdp/rte_eth_af_xdp.c#L1065

Please see the below DPDK logs:
12-23-2022  08:40:42.401426 EAL: Restoring previous memory policy: 0
12-23-2022  08:40:42.401443 EAL: request: mp_malloc_sync
12-23-2022  08:40:42.401447 EAL: No shared files mode enabled, IPC is
disabled
12-23-2022  08:40:42.401451 EAL: Heap on socket 0 was expanded by 32MB
12-23-2022  08:40:42.402929 eth_rx_queue_setup(): Set up rx queue, rx queue
id: 0, xsk queue id: 0
12-23-2022  08:40:42.402973 xdp_umem_configure(): Failed to create umem
12-23-2022  08:40:42.402980 xdp_umem_configure(): ret, error number and
string -105, 105, No buffer space available
12-23-2022  08:40:42.402986 eth_rx_queue_setup(): Failed to configure xdp
socket
12-23-2022  08:40:42.403013 eth_rx_queue_setup(): Set up rx queue, rx queue
id: 0, xsk queue id: 0
12-23-2022  08:40:42.403030 xdp_umem_configure(): Failed to create umem
12-23-2022  08:40:42.403037 xdp_umem_configure(): ret, error number and
string -105, 105, No buffer space available
12-23-2022  08:40:42.403042 eth_rx_queue_setup(): Failed to configure xdp
socket
12-23-2022  08:40:42.403065 eth_rx_queue_setup(): Set up rx queue, rx queue
id: 0, xsk queue id: 0
12-23-2022  08:40:42.403080 xdp_umem_configure(): Failed to create umem
12-23-2022  08:40:42.403085 xdp_umem_configure(): ret, error number and
string -105, 105, No buffer space available

Thanks,
Kamaraj


Re: [PATCH v1 1/2] net/axgbe: add multi-process support

2022-12-23 Thread Ferruh Yigit
On 12/21/2022 2:52 AM, Jesna K E wrote:
> +/* Takes  ethdev as parameter
> + *  Used in dev_start by primary process and then
> + * in dev_init by secondary process when attaching to an existing ethdev.
> + */
> +void
> +axgbe_set_tx_function(struct rte_eth_dev *dev)
> +{
> + struct axgbe_port *pdata = dev->data->dev_private;
> + struct axgbe_tx_queue *txq = dev->data->tx_queues[0];
> +
> + if (pdata->multi_segs_tx)
> + dev->tx_pkt_burst = &axgbe_xmit_pkts_seg;
> +#ifdef RTE_ARCH_X86
> + if (!txq->vector_disable &&
> + rte_vect_get_max_simd_bitwidth() >= RTE_VECT_SIMD_128)
> + dev->tx_pkt_burst = &axgbe_xmit_pkts_vec;
> +#else
> + dev->tx_pkt_burst = &axgbe_xmit_pkts;
> +#endif
> +}

'txq' is used only for 'RTE_ARCH_X86', that is why it gives "unused
variable" warning for it.
Can you please declare 'txq' within 'RTE_ARCH_X86' macro?


Re: [PATCH v1 1/2] net/axgbe: add multi-process support

2022-12-23 Thread Ferruh Yigit
On 12/23/2022 10:44 AM, Ferruh Yigit wrote:
> On 12/21/2022 2:52 AM, Jesna K E wrote:
>> +/* Takes  ethdev as parameter
>> + *  Used in dev_start by primary process and then
>> + * in dev_init by secondary process when attaching to an existing ethdev.
>> + */
>> +void
>> +axgbe_set_tx_function(struct rte_eth_dev *dev)
>> +{
>> +struct axgbe_port *pdata = dev->data->dev_private;
>> +struct axgbe_tx_queue *txq = dev->data->tx_queues[0];
>> +
>> +if (pdata->multi_segs_tx)
>> +dev->tx_pkt_burst = &axgbe_xmit_pkts_seg;
>> +#ifdef RTE_ARCH_X86
>> +if (!txq->vector_disable &&
>> +rte_vect_get_max_simd_bitwidth() >= RTE_VECT_SIMD_128)
>> +dev->tx_pkt_burst = &axgbe_xmit_pkts_vec;
>> +#else
>> +dev->tx_pkt_burst = &axgbe_xmit_pkts;

btw, indentation of this line looks wrong

>> +#endif
>> +}
> 
> 'txq' is used only for 'RTE_ARCH_X86', that is why it gives "unused
> variable" warning for it.
> Can you please declare 'txq' within 'RTE_ARCH_X86' macro?



[RFC PATCH] graph: add support for pcap trace for graph

2022-12-23 Thread Amit Prakash Shukla
Packets are captured at each graph node with the node and mbuf metadata
as part of the pcap.

This is inspired from VPP.

Signed-off-by: Amit Prakash Shukla 
---
 doc/guides/sample_app_ug/l3_forward_graph.rst |   3 +
 examples/l3fwd-graph/main.c   |  12 +
 lib/graph/graph_pcap_trace.c  | 337 ++
 lib/graph/graph_populate.c|   6 +-
 lib/graph/meson.build |   5 +-
 lib/graph/rte_graph_pcap_trace.h  | 149 
 lib/graph/rte_graph_worker.h  |   6 +
 7 files changed, 515 insertions(+), 3 deletions(-)
 create mode 100644 lib/graph/graph_pcap_trace.c
 create mode 100644 lib/graph/rte_graph_pcap_trace.h

diff --git a/doc/guides/sample_app_ug/l3_forward_graph.rst 
b/doc/guides/sample_app_ug/l3_forward_graph.rst
index 0a3e0d44ec..cf199bcf81 100644
--- a/doc/guides/sample_app_ug/l3_forward_graph.rst
+++ b/doc/guides/sample_app_ug/l3_forward_graph.rst
@@ -51,6 +51,7 @@ The application has a number of command line options similar 
to l3fwd::
[--max-pkt-len PKTLEN]
[--no-numa]
[--per-port-pool]
+   [--pcap-enable]
 
 Where,
 
@@ -69,6 +70,8 @@ Where,
 
 * ``--per-port-pool:`` Optional, set to use independent buffer pools per port. 
Without this option, single buffer pool is used for all ports.
 
+* ``--pcap-enable:`` Optional, Enables packet capture in pcap format on each 
node with mbuf and node metadata.
+
 For example, consider a dual processor socket platform with 8 physical cores, 
where cores 0-7 and 16-23 appear on socket 0,
 while cores 8-15 and 24-31 appear on socket 1.
 
diff --git a/examples/l3fwd-graph/main.c b/examples/l3fwd-graph/main.c
index 6dcb6ee92b..b6408310aa 100644
--- a/examples/l3fwd-graph/main.c
+++ b/examples/l3fwd-graph/main.c
@@ -404,6 +404,7 @@ static const char short_options[] = "p:" /* portmask */
 #define CMD_LINE_OPT_NO_NUMA  "no-numa"
 #define CMD_LINE_OPT_MAX_PKT_LEN   "max-pkt-len"
 #define CMD_LINE_OPT_PER_PORT_POOL "per-port-pool"
+#define CMD_LINE_OPT_PCAP_ENABLE   "pcap-enable"
 enum {
/* Long options mapped to a short option */
 
@@ -416,6 +417,7 @@ enum {
CMD_LINE_OPT_NO_NUMA_NUM,
CMD_LINE_OPT_MAX_PKT_LEN_NUM,
CMD_LINE_OPT_PARSE_PER_PORT_POOL,
+   CMD_LINE_OPT_PARSE_PCAP_ENABLE,
 };
 
 static const struct option lgopts[] = {
@@ -424,6 +426,7 @@ static const struct option lgopts[] = {
{CMD_LINE_OPT_NO_NUMA, 0, 0, CMD_LINE_OPT_NO_NUMA_NUM},
{CMD_LINE_OPT_MAX_PKT_LEN, 1, 0, CMD_LINE_OPT_MAX_PKT_LEN_NUM},
{CMD_LINE_OPT_PER_PORT_POOL, 0, 0, CMD_LINE_OPT_PARSE_PER_PORT_POOL},
+   {CMD_LINE_OPT_PCAP_ENABLE, 0, 0, CMD_LINE_OPT_PARSE_PCAP_ENABLE},
{NULL, 0, 0, 0},
 };
 
@@ -498,6 +501,11 @@ parse_args(int argc, char **argv)
per_port_pool = 1;
break;
 
+   case CMD_LINE_OPT_PARSE_PCAP_ENABLE:
+   printf("Packet capture enabled\n");
+   set_pcap_trace(1);
+   break;
+
default:
print_usage(prgname);
return -1;
@@ -831,6 +839,7 @@ main(int argc, char **argv)
local_port_conf.txmode.offloads |=
RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE;
 
+   local_port_conf.rxmode.offloads |= RTE_ETH_RX_OFFLOAD_RSS_HASH;
local_port_conf.rx_adv_conf.rss_conf.rss_hf &=
dev_info.flow_type_rss_offloads;
if (local_port_conf.rx_adv_conf.rss_conf.rss_hf !=
@@ -1116,6 +1125,9 @@ main(int argc, char **argv)
}
/* >8 End of adding route to ip4 graph infa. */
 
+   if (is_pcap_trace_enable())
+   rte_graph_pcap_trace_init();
+
/* Launch per-lcore init on every worker lcore */
rte_eal_mp_remote_launch(graph_main_loop, NULL, SKIP_MAIN);
 
diff --git a/lib/graph/graph_pcap_trace.c b/lib/graph/graph_pcap_trace.c
new file mode 100644
index 00..c5be07de6d
--- /dev/null
+++ b/lib/graph/graph_pcap_trace.c
@@ -0,0 +1,337 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2022 Marvell International Ltd.
+ */
+
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+
+#include "rte_graph_worker.h"
+
+#define RTE_PCAP_MAJOR_VERSION 1
+#define RTE_PCAP_MINOR_VERSION 0
+#define MAX_PKT_TO_CAPTURE 200
+#define MAX_PCAP_BUF_SZ 2048
+
+#define PCAP_DUMP_STR(buf, buf_size, cur_len, ...)   \
+do { \
+   if (cur_len >= buf_size) \
+   break;   \
+   cur_len += snprintf(buf + cur_len, buf_size - cur_len, __VA_ARGS__);

RE: Strange behavior with rte_pktmbuf_clone cal

2022-12-23 Thread Konstantin Ananyev
That's expected behavior.
rte_mbuf_clone() doesn't duplicate your packet data - instead
new mbuf just attaches to existing data packet.
If you need to modify one (or both) packet data after cloning -
you can use rte_pktmbuf_copy() instead.
As another alternative - have a look at examples/ip_multicast -
It demonstrates how to send sake packet to multiple ip destinations
over multiple ports without modifying the entire packet.
It looks that you are trying to do something quite similar.
Konstantin


From: NAGENDRA BALAGANI 
Sent: Friday, December 23, 2022 7:01 AM
To: dev@dpdk.org
Cc: Kapil Kumar Jain 
Subject: Strange behavior with rte_pktmbuf_clone cal

Hi,

I am seeing strange behavior where rte_pktmbuf_clone is not giving desired 
result.
Here is the detailed info, in my dpdk application  , once I received the packet 
info in mbuf, I need to send the same packet to two destinations, the sequence  
I should follow is,

(i)  First, Tunnel the packet to one of desired destination, so 
I created the shallow copy using rte_pktmbuf_clone, had another mbuf for Outer 
IP Header for IPinIP tunnel and sent to NIC.

(ii)Second, I need to modify the source and destination ip 
addresses of the packet and send out.

The issue, I am seeing is the tunneled packet (clone) have modified IP 
addresses from (ii).

Code flow:

Main()
{
Struct rte_mbuf *org_mbuf; //lets assume this org_mbuf is holding the packet 
info.

(i)  Towards First destination.
Build_tunnel_packet(org_mbuf) {


  *   Struct rte_mbuf *clone_buffer;

  *   Allocate a clone buffer Clone_buffer = rte_pktmbuf_clone(org_mbuf, 
clone_pool);


  *   Constructed IPinIP info in another mbuf and prepended in clone_buffer
  *   Call rte_pktmbuf_tx_burst();
}

(ii)Towards another destination.
Modify_l3_and_route(org_mbuf)
{


  *   Modify L3 information of 'org_mbuf'
  *   and Call rte_pkt_mbuf_tx_burst();
}

}

[cid:image002.jpg@01D916EA.D92B9250]

In the above screenshot, the packet 37 should tunneled as it is by adding the 
outer ip layer(i.e 182.16.146.*), but the inner L3 information also getting 
changed (which I am modifying in the second step) for some packets.
Using, rte_pktmbuf_copy(), solving the issue, but in expense of extra mbuf.


Please, help me in understanding what is wrong in the case of 
rte_pktmbuf_clone()?


Regards,
Nagendra




RE: [PATCH v2] mempool: micro-optimize put function

2022-12-23 Thread Konstantin Ananyev


> > Hi Morten,
> >
> > > PING mempool maintainers.
> > >
> > > If you don't provide ACK or Review to patches, how should Thomas know
> > that it is ready for merging?
> > >
> > > -Morten
> >
> > The code changes itself looks ok to me.
> > Though I don't think it would really bring any noticeable speedup.
> > But yes, it looks a bit nicer this way, especially with extra comments.
> 
> Agree, removing the compare and branch instructions from the likely code path 
> provides no noticeable speedup, but makes the code
> more clean.
> 
> > One question though, why do you feel this assert:
> > RTE_ASSERT(cache->len <= cache->flushthresh);
> > is necessary?
> 
> I could have written it into the comment above it, but instead chose to add 
> the assertion as an improved comment.
> 
> > I can't see any way it could happen, unless something is totally broken
> > within the app.
> 
> Agree. These are the circumstances where assertions can come into action. 
> With more assertions in the code, it is likely that less code
> executes before hitting an assertion, making it easier to find the root cause 
> of such errors.
> 
> Please note that RTE_ASSERTs are omitted unless built with RTE_ENABLE_ASSERT, 
> so this assertion is omitted and thus has no
> performance impact for a normal build.

I am aware that RTE_ASSERT is not enabled by default.
My question was more about - why do you feel assert() is necessary in that case 
in particular.
Does it guard for some specific scenario or so.
Anyway, I am ok either way: with and without assert() here.
Acked-by: Konstantin Ananyev  



Re: [RFC PATCH 1/7] lib: add helper to read strings from sysfs files

2022-12-23 Thread Stephen Hemminger
On Fri, 23 Dec 2022 00:24:29 +0100
Tomasz Duszynski  wrote:

> diff --git a/lib/eal/version.map b/lib/eal/version.map
> index 7ad12a7dc9..2dbc10e6be 100644
> --- a/lib/eal/version.map
> +++ b/lib/eal/version.map
> @@ -440,6 +440,9 @@ EXPERIMENTAL {
>   rte_thread_detach;
>   rte_thread_equal;
>   rte_thread_join;
> +
> + # added in 23.03
> + eal_parse_sysfs_string; # WINDOWS_NO_EXPORT
>  };

Should be internal not exported as user API.



Re: Strange behavior with rte_pktmbuf_clone cal

2022-12-23 Thread Stephen Hemminger
On Fri, 23 Dec 2022 07:00:40 +
NAGENDRA BALAGANI  wrote:

> Hi,
> 
> I am seeing strange behavior where rte_pktmbuf_clone is not giving desired 
> result.
> Here is the detailed info, in my dpdk application  , once I received the 
> packet info in mbuf, I need to send the same packet to two destinations, the 
> sequence  I should follow is,
> 
> (i)  First, Tunnel the packet to one of desired destination, 
> so I created the shallow copy using rte_pktmbuf_clone, had another mbuf for 
> Outer IP Header for IPinIP tunnel and sent to NIC.
> 
> (ii)Second, I need to modify the source and destination ip 
> addresses of the packet and send out.

As you showed, that won't work like you think.
A shallow clone means that both versions share the same data area.
If you modify one one instance both change.

The best practice is to consider a clone (or any mbuf with refcount > 1) as 
read only.



RE: [PATCH v2] mempool cache: add zero-copy get and put functions

2022-12-23 Thread Konstantin Ananyev

> > From: Konstantin Ananyev [mailto:konstantin.anan...@huawei.com]
> > Sent: Thursday, 22 December 2022 16.57
> >
> > > Zero-copy access to mempool caches is beneficial for PMD performance,
> > and
> > > must be provided by the mempool library to fix [Bug 1052] without a
> > > performance regression.
> >
> > LGTM in general, thank you for working on it.
> > Few comments below.
> >
> > >
> > > [Bug 1052]: https://bugs.dpdk.org/show_bug.cgi?id=1052
> > >
> > > v2:
> > > * Fix checkpatch warnings.
> > > * Fix missing registration of trace points.
> > > * The functions are inline, so they don't go into the map file.
> > > v1 changes from the RFC:
> > > * Removed run-time parameter checks. (Honnappa)
> > >   This is a hot fast path function; requiring correct application
> > >   behaviour, i.e. function parameters must be valid.
> > > * Added RTE_ASSERT for parameters instead.
> >
> > RTE_ASSERT(n <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> > I think it is too excessive.
> > Just:
> > if (n <= RTE_MEMPOOL_CACHE_MAX_SIZE) return NULL;
> > seems much more convenient for the users here and
> > more close to other mempool/ring API behavior.
> > In terms of performance - I don’t think one extra comparison here
> > would really count.
> 
> The insignificant performance degradation seems like a good tradeoff for 
> making the function more generic.
> I will update the function documentation and place the run-time check here, 
> so both trace and stats reflect what happened:
> 
>   RTE_ASSERT(cache != NULL);
>   RTE_ASSERT(mp != NULL);
> - RTE_ASSERT(n <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> 
>   rte_mempool_trace_cache_zc_put_bulk(cache, mp, n);
> +
> + if (unlikely(n > RTE_MEMPOOL_CACHE_MAX_SIZE)) {
> + rte_errno = -ENOSPC; // Or EINVAL?
> + return NULL;
> + }
> 
>   /* Increment stats now, adding in mempool always succeeds. */
> 
> I will probably also be able to come up with solution for zc_get_bulk(), so 
> both trace and stats make sense if called with n >
> RTE_MEMPOOL_CACHE_MAX_SIZE.
> 
> >
> > I also think would be really good to add:
> > add zc_(get|put)_bulk_start(),  zc_(get|put)_bulk_finish().
> > Where _start would check/fill the cache and return the pointer,
> > while _finsih would updathe cache->len.
> > Similar to what we have for rte_ring _peek_ API.
> > That would allow to extend this API usage - let say inside PMDs
> > it could be used not only for MBUF_FAST_FREE case,  but for generic
> > TX code path (one that have to call rte_mbuf_prefree()) also.
> 
> I don't see a use case for zc_get_start()/_finish().
> 
> And since the mempool cache is a stack, it would *require* that the 
> application reads the array in reverse order. In such case, the
> function should not return a pointer to the array of objects, but a pointer 
> to the top of the stack.
> 
> So I prefer to stick with the single-function zero-copy get, i.e. without 
> start/finish.

Yes, it would be more complicated than just update cache->len.
I don't have any real use-case for _get_ too - mostly just for symmetry with 
put.
 
> 
> 
> I do agree with you about the use case for zc_put_start()/_finish().
> 
> Unlike the ring, there is no need for locking with the mempool cache, so we 
> can implement something much simpler:
> 
> Instead of requiring calling both zc_put_start() and _finish() for every 
> zero-copy burst, we could add a zc_put_rewind() function, only
> to be called if some number of objects were not added anyway:
> 
> /* FIXME: Function documentation here. */
> __rte_experimental
> static __rte_always_inline void
> rte_mempool_cache_zc_put_rewind(struct rte_mempool_cache *cache,
>   unsigned int n)
> {
>   RTE_ASSERT(cache != NULL);
>   RTE_ASSERT(n <= cache->len);
> 
>   rte_mempool_trace_cache_zc_put_rewind(cache, n);
> 
>   /* Rewind stats. */
>   RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, -n);
> 
>   cache->len -= n;
> }
> 
> I have a strong preference for _rewind() over _start() and _finish(), because 
> in the full burst case, it only touches the
> rte_mempool_cache structure once, whereas splitting it up into _start() and 
> _finish() touches the rte_mempool_cache structure both
> before and after copying the array of objects.
> 
> What do you think?

And your concern is that between _get_start(_C_) and get_finish(_C_) the _C_
cache line can be bumped out of CPU Dcache, right?
I don't think such situation would be a common one.
But, if you think _rewind_ is a better approach - I am ok with it. 
 

> I am open for other names than _rewind(), so feel free to speak up if you 
> have a better name.
> 
> 
> >
> > >   Code for this is only generated if built with RTE_ENABLE_ASSERT.
> > > * Removed fallback when 'cache' parameter is not set. (Honnappa)
> > > * Chose the simple get function; i.e. do not move the existing
> > objects in
> > >   the cache to the top of the new stack, just leave them at the
> > bottom.
> > > * Renamed the functions. Other