Re: [dpdk-dev] Inconsistent behavior of mempool with regards to hugepage allocation

2019-12-27 Thread Olivier Matz
On Thu, Dec 26, 2019 at 04:45:24PM +0100, Olivier Matz wrote:
> Hi Bao-Long,
> 
> On Mon, Dec 23, 2019 at 07:09:29PM +0800, Bao-Long Tran wrote:
> > Hi,
> > 
> > I'm not sure if this is a bug, but I've seen an inconsistency in the 
> > behavior 
> > of DPDK with regards to hugepage allocation for rte_mempool. Basically, for 
> > the
> > same mempool size, the number of hugepages allocated changes from run to 
> > run.
> > 
> > Here's how I reproduce with DPDK 19.11. IOVA=pa (default)
> > 
> > 1. Reserve 16x1G hugepages on socket 0 
> > 2. Replace examples/skeleton/basicfwd.c with the code below, build and run
> > make && ./build/basicfwd 
> > 3. At the same time, watch the number of hugepages allocated 
> > "watch -n.1 ls /dev/hugepages"
> > 4. Repeat step 2
> > 
> > If you can reproduce, you should see that for some runs, DPDK allocates 5
> > hugepages, other times it allocates 6. When it allocates 6, if you watch 
> > the 
> > output from step 3., you will see that DPDK first  try to allocate 5 
> > hugepages, 
> > then unmap all 5, retry, and got 6.
> 
> I cannot reproduce in the same conditions than yours (with 16 hugepages
> on socket 0), but I think I can see a similar issue:
> 
> If I reserve at least 6 hugepages, it seems reproducible (6 hugepages
> are used). If I reserve 5 hugepages, it takes more time,
> taking/releasing hugepages several times, and it finally succeeds with 5
> hugepages.
> 
> > For our use case, it's important that DPDK allocate the same number of 
> > hugepages on every run so we can get reproducable results.
> 
> One possibility is to use the --legacy-mem EAL option. It will try to
> reserve all hugepages first.

Passing --socket-mem=5120,0 also does the job.

> > Studying the code, this seems to be the behavior of
> > rte_mempool_populate_default(). If I understand correctly, if the first try 
> > fail
> > to get 5 IOVA-contiguous pages, it retries, relaxing the IOVA-contiguous
> > condition, and eventually wound up with 6 hugepages.
> 
> No, I think you don't have the IOVA-contiguous constraint in your
> case. This is what I see:
> 
> a- reserve 5 hugepages on socket 0, and start your patched basicfwd
> b- it tries to allocate 2097151 objects of size 2304, pg_size = 1073741824
> c- the total element size (with header) is 2304 + 64 = 2368
> d- in rte_mempool_op_calc_mem_size_helper(), it calculates
>obj_per_page = 453438(453438 * 2368 = 1073741184)
>mem_size = 4966058495
> e- it tries to allocate 4966058495 bytes, which is less than 5 x 1G, with:
>rte_memzone_reserve_aligned(name, size=4966058495, socket=0,
>  mz_flags=RTE_MEMZONE_1GB|RTE_MEMZONE_SIZE_HINT_ONLY,
>  align=64)
>For some reason, it fails: we can see that the number of map'd hugepages
>increases in /dev/hugepages, the return to its original value.
>I don't think it should fail here.
> f- then, it will try to allocate the biggest available contiguous zone. In
>my case, it is 1055291776 bytes (almost all the uniq map'd hugepage).
>This is a second problem: if we call it again, it returns NULL, because
>it won't map another hugepage.
> g- by luck, calling rte_mempool_populate_virt() allocates a small aera
>(mempool header), and it triggers the mapping a a new hugepage, that
>will be used in the next loop, back at step d with a smaller mem_size.
> 
> > Questions: 
> > 1. Why does the API sometimes fail to get IOVA contig mem, when hugepage 
> > memory 
> > is abundant? 
> 
> In my case, it looks that we have a bit less than 1G which is free at
> the end of the heap, than we call rte_memzone_reserve_aligned(size=5G).
> The allocator ends up in mapping 5 pages (and fail), while only 4 is
> needed.
> 
> Anatoly, do you have any idea? Shouldn't we take in account the amount
> of free space at the end of the heap when expanding?
> 
> > 2. Why does the 2nd retry need N+1 hugepages?
> 
> When the first alloc fails, the mempool code tries to allocate in
> several chunks which are not virtually contiguous. This is needed in
> case the memory is fragmented.
> 
> > Some insights for Q1: From my experiments, seems like the IOVA of the first
> > hugepage is not guaranteed to be at the start of the IOVA space 
> > (understandably).
> > It could explain the retry when the IOVA of the first hugepage is near the 
> > end of 
> > the IOVA space. But I have also seen situation where the 1st hugepage is 
> > near
> > the beginning of the IOVA space and it still failed the 1st time.
> > 
> > Here's the code:
> > #include 
> > #include 
> > 
> > int
> > main(int argc, char *argv[])
> > {
> > struct rte_mempool *mbuf_pool;
> > unsigned mbuf_pool_size = 2097151;
> > 
> > int ret = rte_eal_init(argc, argv);
> > if (ret < 0)
> > rte_exit(EXIT_FAILURE, "Error with EAL initialization\n");
> > 
> > printf("Creating mbuf pool size=%u\n", mbuf_pool_size);
> > mbuf_pool = rte_pktmbuf_pool_create("MBUF_POOL", mbuf_pool_size,
> > 256, 0, RTE_MBUF_

Re: [dpdk-dev] [PATCH v5 0/4] drivers/net: cleanup Tx buffers

2019-12-27 Thread Zhang, Xiao


> -Original Message-
> From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Chenxu Di
> Sent: Friday, December 27, 2019 11:45 AM
> To: dev@dpdk.org
> Cc: Yang, Qiming ; Xing, Beilei
> ; Di, ChenxuX 
> Subject: [dpdk-dev] [PATCH v5 0/4] drivers/net: cleanup Tx buffers
> 
> Add support to the drivers inclulding i40e, ice, ixgbe and igb vf for the API
> rte_eth_tx_done_cleanup to force free consumed buffers on Tx ring.
> 
> ---
> v2:
> added code about igb vf.
> v3:
> changed information of author
> v4:
> changed code.
> v5:
> fixed code and notes.
> removed code for fm10k.
> 
> Chenxu Di (4):
>   net/i40e: cleanup Tx buffers
>   net/ice: cleanup Tx buffers
>   net/ixgbe: cleanup Tx buffers
>   net/e1000: cleanup Tx buffers
> 
>  drivers/net/e1000/igb_ethdev.c|   1 +
>  drivers/net/i40e/i40e_ethdev.c|   1 +
>  drivers/net/i40e/i40e_ethdev_vf.c |   1 +
>  drivers/net/i40e/i40e_rxtx.c  | 122 +
>  drivers/net/i40e/i40e_rxtx.h  |   1 +
>  drivers/net/ice/ice_ethdev.c  |   1 +
>  drivers/net/ice/ice_rxtx.c| 123 ++
>  drivers/net/ice/ice_rxtx.h|   1 +
>  drivers/net/ixgbe/ixgbe_ethdev.c  |   2 +
>  drivers/net/ixgbe/ixgbe_rxtx.c| 121 +
>  drivers/net/ixgbe/ixgbe_rxtx.h|   2 +
>  11 files changed, 376 insertions(+)
> 
> --
> 2.17.1

Reviewed-by: Xiao Zhang 


Re: [dpdk-dev] [RFC v2] mlx5/net: hint PMD not to inline packet

2019-12-27 Thread Olivier Matz
Hi Viacheslav,

On Wed, Dec 11, 2019 at 05:01:33PM +, Viacheslav Ovsiienko wrote:
> Some PMDs inline the mbuf data buffer directly to device transmit descriptor.
> This is in order to save the overhead of the PCI headers imposed when the
> device DMA reads the data by buffer pointer. For some devices it is essential
> in order to provide the full bandwidth.
> 
> However, there are cases where such inlining is in-efficient. For example, 
> when
> the data buffer resides on other device memory (like GPU or storage device).
> Attempt to inline such buffer will result in high PCI overhead for reading
> and copying the data from the remote device to the host memory.
> 
> To support a mixed traffic pattern (some buffers from local host memory, some
> buffers from other devices) with high bandwidth, a hint flag is introduced in
> the mbuf.
> 
> Application will hint the PMD whether or not it should try to inline the
> given mbuf data buffer. PMD should do the best effort to act upon this 
> request.
> 
> The hint flag RTE_NET_MLX5_DYNFLAG_NO_INLINE_NAME is supposed to be dynamic,
> registered by application with rte_mbuf_dynflag_register(). This flag is
> purely vendor specific and declared in PMD specific header rte_pmd_mlx5.h,
> which is intended to be used by specific application.
> 
> To query the supported specific flags in runtime the private routine is
> introduced:
> 
> int rte_pmd_mlx5_get_dyn_flag_names(
> uint16_t port,
>   char *names[],
> uint16_t n)
> 
> It returns the array of currently (over present hardware and configuration)
> supported specific flags.
> 
> The "not inline hint" feature operating flow is the following one:
> - application start
> - probe the devices, ports are created
> - query the port capabilities
> - if port supporting the feature is found
>   - register dynamic flag RTE_NET_MLX5_DYNFLAG_NO_INLINE_NAME
> - application starts the ports
> - on dev_start() PMD checks whether the feature flag is registered and
>   enables the feature support in datapath
> - application might set this flag in ol_flags field of mbuf in the packets
>   being sent and PMD will handle ones appropriately.
> 
> Signed-off-by: Shahaf Shuler 
> Signed-off-by: Viacheslav Ovsiienko 
> 
> ---
> v1: https://patches.dpdk.org/patch/61348/
> 

It looks the patch is missing.

I think a dynamic flag is the good solution for this problem: the pmd
can send a pmd-specific hint to the application, without impacting the
way it works today.


Olivier


Re: [dpdk-dev] [PATCH v2] mbuf: display more fields in dump

2019-12-27 Thread Olivier Matz
Hi,

On Thu, Dec 26, 2019 at 08:58:51AM -0800, Stephen Hemminger wrote:
> On Thu, 26 Dec 2019 17:15:39 +0100
> Olivier Matz  wrote:
> 
> > Hi,
> > 
> > On Thu, Nov 21, 2019 at 10:30:55AM -0800, Stephen Hemminger wrote:
> > > The rte_pktmbuf_dump should display offset, refcount, and vlan
> > > info since these are often useful during debugging.
> > > 
> > > Signed-off-by: Stephen Hemminger 
> > > ---
> > > v2 - remove casts, change in_port to port
> > >  the refcount and offset are property of per-segment
> > > 
> > >  lib/librte_mbuf/rte_mbuf.c | 17 ++---
> > >  1 file changed, 10 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c
> > > index 35df1c4c38a5..4894d46628e3 100644
> > > --- a/lib/librte_mbuf/rte_mbuf.c
> > > +++ b/lib/librte_mbuf/rte_mbuf.c
> > > @@ -473,18 +473,21 @@ rte_pktmbuf_dump(FILE *f, const struct rte_mbuf *m, 
> > > unsigned dump_len)
> > >  
> > >   __rte_mbuf_sanity_check(m, 1);
> > >  
> > > - fprintf(f, "dump mbuf at %p, iova=%"PRIx64", buf_len=%u\n",
> > > -m, (uint64_t)m->buf_iova, (unsigned)m->buf_len);
> > > - fprintf(f, "  pkt_len=%"PRIu32", ol_flags=%"PRIx64", nb_segs=%u, "
> > > -"in_port=%u\n", m->pkt_len, m->ol_flags,
> > > -(unsigned)m->nb_segs, (unsigned)m->port);
> > > + fprintf(f, "dump mbuf at %p, iova=%#"PRIx64", buf_len=%u\n",
> > > + m, m->buf_iova, m->buf_len);
> > > + fprintf(f,
> > > + "  pkt_len=%u, ol_flags=%#"PRIx64", nb_segs=%u, port=%u, 
> > > vlan_tci=%#x\n",
> > > + m->pkt_len, m->ol_flags, m->nb_segs, m->port, m->vlan_tci);
> > > +
> > >   nb_segs = m->nb_segs;
> > >  
> > >   while (m && nb_segs != 0) {
> > >   __rte_mbuf_sanity_check(m, 0);
> > >  
> > > - fprintf(f, "  segment at %p, data=%p, data_len=%u\n",
> > > - m, rte_pktmbuf_mtod(m, void *), (unsigned)m->data_len);
> > > + fprintf(f, "  segment at %p, data=%p, len=%u, off=%u, 
> > > refcnt=%u\n",
> > > + m, rte_pktmbuf_mtod(m, void *),
> > > + m->data_len, m->data_off, rte_mbuf_refcnt_read(m));
> > > +
> > >   len = dump_len;
> > >   if (len > m->data_len)
> > >   len = m->data_len;
> > > -- 
> > > 2.20.1
> > >   
> > 
> > Thanks for this enhancement.
> > 
> > One comment however: I don't see why vlan_tci would be important than
> > packet type or rss tag. It is also a bit confusing to display a field
> > which can have a random value (if the vlan flag is not set in ol_flags).
> > 
> > I'd prefer to dump vlan_tci only if the flag is present. Later we could
> > add more fields with the same logic. What do you think?
> 
> Because this is a debugging hook and easier to always display the value.
> Thats all.

This is precisely because it is a debug hook that I find dangerous to
display a wrong value. As the mbuf are recycled, a mbuf that does not
have a vlan_tci can display its previous value, which may look valid.

What about simply doing this?

fprintf(f,
"  pkt_len=%u, ol_flags=%#"PRIx64", nb_segs=%u, port=%u",
m->pkt_len, m->ol_flags, m->nb_segs, m->port);
if (m->ol_flags & PKT_RX_VLAN)
fprintf(f, ", vlan_tci=%#x", m->vlan_tci);
fprintf(f, "\n");

If you want I can submit an updated version of your patch.


Re: [dpdk-dev] Inconsistent behavior of mempool with regards to hugepage allocation

2019-12-27 Thread Bao-Long Tran
Hi Olivier,

> On 27 Dec 2019, at 4:11 PM, Olivier Matz  wrote:
> 
> On Thu, Dec 26, 2019 at 04:45:24PM +0100, Olivier Matz wrote:
>> Hi Bao-Long,
>> 
>> On Mon, Dec 23, 2019 at 07:09:29PM +0800, Bao-Long Tran wrote:
>>> Hi,
>>> 
>>> I'm not sure if this is a bug, but I've seen an inconsistency in the 
>>> behavior 
>>> of DPDK with regards to hugepage allocation for rte_mempool. Basically, for 
>>> the
>>> same mempool size, the number of hugepages allocated changes from run to 
>>> run.
>>> 
>>> Here's how I reproduce with DPDK 19.11. IOVA=pa (default)
>>> 
>>> 1. Reserve 16x1G hugepages on socket 0 
>>> 2. Replace examples/skeleton/basicfwd.c with the code below, build and run
>>> make && ./build/basicfwd 
>>> 3. At the same time, watch the number of hugepages allocated 
>>> "watch -n.1 ls /dev/hugepages"
>>> 4. Repeat step 2
>>> 
>>> If you can reproduce, you should see that for some runs, DPDK allocates 5
>>> hugepages, other times it allocates 6. When it allocates 6, if you watch 
>>> the 
>>> output from step 3., you will see that DPDK first  try to allocate 5 
>>> hugepages, 
>>> then unmap all 5, retry, and got 6.
>> 
>> I cannot reproduce in the same conditions than yours (with 16 hugepages
>> on socket 0), but I think I can see a similar issue:
>> 
>> If I reserve at least 6 hugepages, it seems reproducible (6 hugepages
>> are used). If I reserve 5 hugepages, it takes more time,
>> taking/releasing hugepages several times, and it finally succeeds with 5
>> hugepages.

My apology: I just checked again, I was using DPDK 19.05, not 19.11 or master. 
Let me try to see if I can repro my issue with 19.11. Sorry for the confusion.

I also saw your patch to reduce wasted memory (eba11e). Seems like it resolves
the problem with the IOVA-contig constraint that I described in my first 
message.
I'll look into it to confirm.

If I cannot repro my issue (different number of hugepages) with 19.11, from our
side we can upgrade to 19.11 and that's all we need for now. But let me also try
to repro the issue you described (multiple attempts to allocate hugepages).

>> 
>>> For our use case, it's important that DPDK allocate the same number of 
>>> hugepages on every run so we can get reproducable results.
>> 
>> One possibility is to use the --legacy-mem EAL option. It will try to
>> reserve all hugepages first.
> 
> Passing --socket-mem=5120,0 also does the job.
> 

>>> Studying the code, this seems to be the behavior of
>>> rte_mempool_populate_default(). If I understand correctly, if the first try 
>>> fail
>>> to get 5 IOVA-contiguous pages, it retries, relaxing the IOVA-contiguous
>>> condition, and eventually wound up with 6 hugepages.
>> 
>> No, I think you don't have the IOVA-contiguous constraint in your
>> case. This is what I see:
>> 
>> a- reserve 5 hugepages on socket 0, and start your patched basicfwd
>> b- it tries to allocate 2097151 objects of size 2304, pg_size = 1073741824
>> c- the total element size (with header) is 2304 + 64 = 2368
>> d- in rte_mempool_op_calc_mem_size_helper(), it calculates
>>   obj_per_page = 453438(453438 * 2368 = 1073741184)
>>   mem_size = 4966058495
>> e- it tries to allocate 4966058495 bytes, which is less than 5 x 1G, with:
>>   rte_memzone_reserve_aligned(name, size=4966058495, socket=0,
>> mz_flags=RTE_MEMZONE_1GB|RTE_MEMZONE_SIZE_HINT_ONLY,
>> align=64)
>>   For some reason, it fails: we can see that the number of map'd hugepages
>>   increases in /dev/hugepages, the return to its original value.
>>   I don't think it should fail here.
>> f- then, it will try to allocate the biggest available contiguous zone. In
>>   my case, it is 1055291776 bytes (almost all the uniq map'd hugepage).
>>   This is a second problem: if we call it again, it returns NULL, because
>>   it won't map another hugepage.
>> g- by luck, calling rte_mempool_populate_virt() allocates a small aera
>>   (mempool header), and it triggers the mapping a a new hugepage, that
>>   will be used in the next loop, back at step d with a smaller mem_size.
>> 

>>> Questions: 
>>> 1. Why does the API sometimes fail to get IOVA contig mem, when hugepage 
>>> memory 
>>> is abundant? 
>> 
>> In my case, it looks that we have a bit less than 1G which is free at
>> the end of the heap, than we call rte_memzone_reserve_aligned(size=5G).
>> The allocator ends up in mapping 5 pages (and fail), while only 4 is
>> needed.
>> 
>> Anatoly, do you have any idea? Shouldn't we take in account the amount
>> of free space at the end of the heap when expanding?
>> 
>>> 2. Why does the 2nd retry need N+1 hugepages?
>> 
>> When the first alloc fails, the mempool code tries to allocate in
>> several chunks which are not virtually contiguous. This is needed in
>> case the memory is fragmented.
>> 
>>> Some insights for Q1: From my experiments, seems like the IOVA of the first
>>> hugepage is not guaranteed to be at the start of the IOVA space 
>>> (understandably).
>>> It could explain the retry when the IOVA of

Re: [dpdk-dev] Inconsistent behavior of mempool with regards to hugepage allocation

2019-12-27 Thread Olivier Matz
Hi Bao-Long,

On Fri, Dec 27, 2019 at 06:05:57PM +0800, Bao-Long Tran wrote:
> Hi Olivier,
> 
> > On 27 Dec 2019, at 4:11 PM, Olivier Matz  wrote:
> > 
> > On Thu, Dec 26, 2019 at 04:45:24PM +0100, Olivier Matz wrote:
> >> Hi Bao-Long,
> >> 
> >> On Mon, Dec 23, 2019 at 07:09:29PM +0800, Bao-Long Tran wrote:
> >>> Hi,
> >>> 
> >>> I'm not sure if this is a bug, but I've seen an inconsistency in the 
> >>> behavior 
> >>> of DPDK with regards to hugepage allocation for rte_mempool. Basically, 
> >>> for the
> >>> same mempool size, the number of hugepages allocated changes from run to 
> >>> run.
> >>> 
> >>> Here's how I reproduce with DPDK 19.11. IOVA=pa (default)
> >>> 
> >>> 1. Reserve 16x1G hugepages on socket 0 
> >>> 2. Replace examples/skeleton/basicfwd.c with the code below, build and run
> >>> make && ./build/basicfwd 
> >>> 3. At the same time, watch the number of hugepages allocated 
> >>> "watch -n.1 ls /dev/hugepages"
> >>> 4. Repeat step 2
> >>> 
> >>> If you can reproduce, you should see that for some runs, DPDK allocates 5
> >>> hugepages, other times it allocates 6. When it allocates 6, if you watch 
> >>> the 
> >>> output from step 3., you will see that DPDK first  try to allocate 5 
> >>> hugepages, 
> >>> then unmap all 5, retry, and got 6.
> >> 
> >> I cannot reproduce in the same conditions than yours (with 16 hugepages
> >> on socket 0), but I think I can see a similar issue:
> >> 
> >> If I reserve at least 6 hugepages, it seems reproducible (6 hugepages
> >> are used). If I reserve 5 hugepages, it takes more time,
> >> taking/releasing hugepages several times, and it finally succeeds with 5
> >> hugepages.
> 
> My apology: I just checked again, I was using DPDK 19.05, not 19.11 or 
> master. 
> Let me try to see if I can repro my issue with 19.11. Sorry for the confusion.
> 
> I also saw your patch to reduce wasted memory (eba11e). Seems like it resolves
> the problem with the IOVA-contig constraint that I described in my first 
> message.
> I'll look into it to confirm.
> 
> If I cannot repro my issue (different number of hugepages) with 19.11, from 
> our
> side we can upgrade to 19.11 and that's all we need for now. But let me also 
> try
> to repro the issue you described (multiple attempts to allocate hugepages).

OK, thanks.

Anyway, I think there is an issue on 19.11. And it is is even worse with
2M hugepages. Let's say we reserve 500x 2M hugepages, and try to
allocate a mempool of 5G:

1/ mempool_populate tries to allocate in one virtually contiguous block,
   which maps all 500 hugepages, then fail, unmapping them
2/ it tries to allocate the largest zone, which returns ~2MB.
3/ this zone is added to the mempool, and for that, it allocates a
   mem_header struct, which triggers the mapping of a new page.
4/ Back to 1... until it fails after 3 mins

The memzone allocation of "largest available area" does not have the
same semantic depending on the memory model (pre-mapped hugepages or
not). When using dynamic hugepage mapping, it won't map any additional
hugepage.

To solve the issue, we could either change it to allocate all available
hugepages, or change mempool populate, by not using the "largest
available area" allocation, doing the search by ourself.


> 
> >> 
> >>> For our use case, it's important that DPDK allocate the same number of 
> >>> hugepages on every run so we can get reproducable results.
> >> 
> >> One possibility is to use the --legacy-mem EAL option. It will try to
> >> reserve all hugepages first.
> > 
> > Passing --socket-mem=5120,0 also does the job.
> > 
> 
> >>> Studying the code, this seems to be the behavior of
> >>> rte_mempool_populate_default(). If I understand correctly, if the first 
> >>> try fail
> >>> to get 5 IOVA-contiguous pages, it retries, relaxing the IOVA-contiguous
> >>> condition, and eventually wound up with 6 hugepages.
> >> 
> >> No, I think you don't have the IOVA-contiguous constraint in your
> >> case. This is what I see:
> >> 
> >> a- reserve 5 hugepages on socket 0, and start your patched basicfwd
> >> b- it tries to allocate 2097151 objects of size 2304, pg_size = 1073741824
> >> c- the total element size (with header) is 2304 + 64 = 2368
> >> d- in rte_mempool_op_calc_mem_size_helper(), it calculates
> >>   obj_per_page = 453438(453438 * 2368 = 1073741184)
> >>   mem_size = 4966058495
> >> e- it tries to allocate 4966058495 bytes, which is less than 5 x 1G, with:
> >>   rte_memzone_reserve_aligned(name, size=4966058495, socket=0,
> >> mz_flags=RTE_MEMZONE_1GB|RTE_MEMZONE_SIZE_HINT_ONLY,
> >> align=64)
> >>   For some reason, it fails: we can see that the number of map'd hugepages
> >>   increases in /dev/hugepages, the return to its original value.
> >>   I don't think it should fail here.
> >> f- then, it will try to allocate the biggest available contiguous zone. In
> >>   my case, it is 1055291776 bytes (almost all the uniq map'd hugepage).
> >>   This is a second problem: if we call it again, it returns 

[dpdk-dev] [PATCH v2 0/6] OCTEON TX2 End Point Driver

2019-12-27 Thread Mahipal Challa
This patchset adds support for OCTEON TX2 end point mode of operation.
The driver implementation uses DPDK rawdevice sub-system.

v2:

* Updated memory barrior API's as per Gavin Hu suggestion.

Mahipal Challa (6):
  raw/octeontx2_ep: add build infra and device probe
  raw/octeontx2_ep: add device configuration
  raw/octeontx2_ep: add device uninitialization
  raw/octeontx2_ep: add enqueue operation
  raw/octeontx2_ep: add dequeue operation
  raw/octeontx2_ep: add driver self test

 MAINTAINERS|   5 +
 config/common_base |   5 +
 doc/guides/rawdevs/index.rst   |   1 +
 doc/guides/rawdevs/octeontx2_ep.rst|  89 +++
 drivers/common/octeontx2/hw/otx2_sdp.h | 184 +
 drivers/common/octeontx2/otx2_common.c |   9 +
 drivers/common/octeontx2/otx2_common.h |   4 +
 .../octeontx2/rte_common_octeontx2_version.map |   6 +
 drivers/raw/Makefile   |   1 +
 drivers/raw/meson.build|   1 +
 drivers/raw/octeontx2_ep/Makefile  |  44 ++
 drivers/raw/octeontx2_ep/meson.build   |   9 +
 drivers/raw/octeontx2_ep/otx2_ep_enqdeq.c  | 844 +
 drivers/raw/octeontx2_ep/otx2_ep_enqdeq.h  |  52 ++
 drivers/raw/octeontx2_ep/otx2_ep_rawdev.c  | 361 +
 drivers/raw/octeontx2_ep/otx2_ep_rawdev.h  | 499 
 drivers/raw/octeontx2_ep/otx2_ep_test.c| 164 
 drivers/raw/octeontx2_ep/otx2_ep_vf.c  | 476 
 drivers/raw/octeontx2_ep/otx2_ep_vf.h  |  10 +
 .../rte_rawdev_octeontx2_ep_version.map|   4 +
 mk/rte.app.mk  |   2 +
 21 files changed, 2770 insertions(+)
 create mode 100644 doc/guides/rawdevs/octeontx2_ep.rst
 create mode 100644 drivers/common/octeontx2/hw/otx2_sdp.h
 create mode 100644 drivers/raw/octeontx2_ep/Makefile
 create mode 100644 drivers/raw/octeontx2_ep/meson.build
 create mode 100644 drivers/raw/octeontx2_ep/otx2_ep_enqdeq.c
 create mode 100644 drivers/raw/octeontx2_ep/otx2_ep_enqdeq.h
 create mode 100644 drivers/raw/octeontx2_ep/otx2_ep_rawdev.c
 create mode 100644 drivers/raw/octeontx2_ep/otx2_ep_rawdev.h
 create mode 100644 drivers/raw/octeontx2_ep/otx2_ep_test.c
 create mode 100644 drivers/raw/octeontx2_ep/otx2_ep_vf.c
 create mode 100644 drivers/raw/octeontx2_ep/otx2_ep_vf.h
 create mode 100644 drivers/raw/octeontx2_ep/rte_rawdev_octeontx2_ep_version.map

-- 
1.8.3.1



[dpdk-dev] [dpdk-dev PATCH v2 2/6] raw/octeontx2_ep: add device configuration

2019-12-27 Thread Mahipal Challa
Register "dev_configure" API to configure/initialize the SDP
VF PCIe devices.

Signed-off-by: Mahipal Challa 
---
 doc/guides/rawdevs/octeontx2_ep.rst|  29 ++
 drivers/common/octeontx2/hw/otx2_sdp.h | 184 +
 drivers/common/octeontx2/otx2_common.c |   9 +
 drivers/common/octeontx2/otx2_common.h |   4 +
 .../octeontx2/rte_common_octeontx2_version.map |   6 +
 drivers/raw/octeontx2_ep/Makefile  |   3 +
 drivers/raw/octeontx2_ep/meson.build   |   4 +-
 drivers/raw/octeontx2_ep/otx2_ep_enqdeq.c  | 294 ++
 drivers/raw/octeontx2_ep/otx2_ep_enqdeq.h  |  11 +
 drivers/raw/octeontx2_ep/otx2_ep_rawdev.c  | 148 +++
 drivers/raw/octeontx2_ep/otx2_ep_rawdev.h  | 434 -
 drivers/raw/octeontx2_ep/otx2_ep_vf.c  | 408 +++
 drivers/raw/octeontx2_ep/otx2_ep_vf.h  |  10 +
 13 files changed, 1542 insertions(+), 2 deletions(-)

diff --git a/doc/guides/rawdevs/octeontx2_ep.rst 
b/doc/guides/rawdevs/octeontx2_ep.rst
index 5f5ed01..2507fcf 100644
--- a/doc/guides/rawdevs/octeontx2_ep.rst
+++ b/doc/guides/rawdevs/octeontx2_ep.rst
@@ -39,3 +39,32 @@ entry `sriov_numvfs` for the corresponding PF driver.
 
 Once the required VFs are enabled, to be accessible from DPDK, VFs need to be
 bound to vfio-pci driver.
+
+Device Setup
+
+
+The OCTEON TX2 SDP End Point VF devices will need to be bound to a
+user-space IO driver for use. The script ``dpdk-devbind.py`` script
+included with DPDK can be used to view the state of the devices and to bind
+them to a suitable DPDK-supported kernel driver. When querying the status
+of the devices, they will appear under the category of "Misc (rawdev)
+devices", i.e. the command ``dpdk-devbind.py --status-dev misc`` can be
+used to see the state of those devices alone.
+
+Device Configuration
+
+
+Configuring SDP EP rawdev device is done using the ``rte_rawdev_configure()``
+API, which takes the mempool as parameter. PMD uses this pool to send/receive
+packets to/from the HW.
+
+The following code shows how the device is configured
+
+.. code-block:: c
+
+   struct sdp_rawdev_info config = {0};
+   struct rte_rawdev_info rdev_info = {.dev_private = &config};
+   config.enqdeq_mpool = (void *)rte_mempool_create(...);
+
+   rte_rawdev_configure(dev_id, (rte_rawdev_obj_t)&rdev_info);
+
diff --git a/drivers/common/octeontx2/hw/otx2_sdp.h 
b/drivers/common/octeontx2/hw/otx2_sdp.h
new file mode 100644
index 000..7e03317
--- /dev/null
+++ b/drivers/common/octeontx2/hw/otx2_sdp.h
@@ -0,0 +1,184 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2019 Marvell International Ltd.
+ */
+
+#ifndef __OTX2_SDP_HW_H_
+#define __OTX2_SDP_HW_H_
+
+/* SDP VF IOQs */
+#define SDP_MIN_RINGS_PER_VF(1)
+#define SDP_MAX_RINGS_PER_VF(8)
+
+/* SDP VF IQ configuration */
+#define SDP_VF_MAX_IQ_DESCRIPTORS   (512)
+#define SDP_VF_MIN_IQ_DESCRIPTORS   (128)
+
+#define SDP_VF_DB_MIN   (1)
+#define SDP_VF_DB_TIMEOUT   (1)
+#define SDP_VF_INTR_THRESHOLD   (0x)
+
+#define SDP_VF_64BYTE_INSTR (64)
+#define SDP_VF_32BYTE_INSTR (32)
+
+/* SDP VF OQ configuration */
+#define SDP_VF_MAX_OQ_DESCRIPTORS   (512)
+#define SDP_VF_MIN_OQ_DESCRIPTORS   (128)
+#define SDP_VF_OQ_BUF_SIZE  (2048)
+#define SDP_VF_OQ_REFIL_THRESHOLD   (16)
+
+#define SDP_VF_OQ_INFOPTR_MODE  (1)
+#define SDP_VF_OQ_BUFPTR_MODE   (0)
+
+#define SDP_VF_OQ_INTR_PKT  (1)
+#define SDP_VF_OQ_INTR_TIME (10)
+#define SDP_VF_CFG_IO_QUEUESSDP_MAX_RINGS_PER_VF
+
+/* Wait time in milliseconds for FLR */
+#define SDP_VF_PCI_FLR_WAIT (100)
+#define SDP_VF_BUSY_LOOP_COUNT  (1)
+
+#define SDP_VF_MAX_IO_QUEUESSDP_MAX_RINGS_PER_VF
+#define SDP_VF_MIN_IO_QUEUESSDP_MIN_RINGS_PER_VF
+
+/* SDP VF IOQs per rawdev */
+#define SDP_VF_MAX_IOQS_PER_RAWDEV  SDP_VF_MAX_IO_QUEUES
+#define SDP_VF_DEFAULT_IOQS_PER_RAWDEV  SDP_VF_MIN_IO_QUEUES
+
+/* SDP VF Register definitions */
+#define SDP_VF_RING_OFFSET(0x1ull << 17)
+
+/* SDP VF IQ Registers */
+#define SDP_VF_R_IN_CONTROL_START (0x1)
+#define SDP_VF_R_IN_ENABLE_START  (0x10010)
+#define SDP_VF_R_IN_INSTR_BADDR_START (0x10020)
+#define SDP_VF_R_IN_INSTR_RSIZE_START (0x10030)
+#define SDP_VF_R_IN_INSTR_DBELL_START (0x10040)
+#define SDP_VF_R_IN_CNTS_START(0x10050)
+#define SDP_VF_R_IN_INT_LEVELS_START  (0x10060)
+#define SDP_VF_R_IN_PKT_CNT_START (0x10080)
+#define SDP_VF_R_IN_BYTE_CNT_START(0x10090)
+
+#define SDP_VF_R_IN_CONTROL(ring)  \
+   (SDP_VF_R_IN_CONTROL_START + ((ring) * SDP_VF_RING_OFFSET))
+
+#define SDP_VF_R_IN_ENABLE(ring)   \
+   (SDP_VF_R_IN_ENABLE_START + ((ring) * SDP_VF_RING_OFFSET))
+
+#define SDP_VF_R_IN_INSTR_BADDR(ring)   \
+   (SDP_VF_R_IN_INSTR_BADDR_START 

[dpdk-dev] [dpdk-dev PATCH v2 1/6] raw/octeontx2_ep: add build infra and device probe

2019-12-27 Thread Mahipal Challa
Add the OCTEON TX2 SDP EP device probe along with the
build infrastructure for Make and meson builds.

Signed-off-by: Mahipal Challa 
---
 MAINTAINERS|   5 +
 config/common_base |   5 +
 doc/guides/rawdevs/index.rst   |   1 +
 doc/guides/rawdevs/octeontx2_ep.rst|  41 +++
 drivers/raw/Makefile   |   1 +
 drivers/raw/meson.build|   1 +
 drivers/raw/octeontx2_ep/Makefile  |  40 +++
 drivers/raw/octeontx2_ep/meson.build   |   6 +
 drivers/raw/octeontx2_ep/otx2_ep_rawdev.c  | 132 +
 drivers/raw/octeontx2_ep/otx2_ep_rawdev.h  |  21 
 .../rte_rawdev_octeontx2_ep_version.map|   4 +
 mk/rte.app.mk  |   2 +
 12 files changed, 259 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 4395d8d..24f1240 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1173,6 +1173,11 @@ M: Vamsi Attunuru 
 F: drivers/raw/octeontx2_dma/
 F: doc/guides/rawdevs/octeontx2_dma.rst
 
+Marvell OCTEON TX2 EP
+M: Mahipal Challa 
+F: drivers/raw/octeontx2_ep/
+F: doc/guides/rawdevs/octeontx2_ep.rst
+
 NTB
 M: Xiaoyun Li 
 M: Jingjing Wu 
diff --git a/config/common_base b/config/common_base
index 7dec7ed..8e7dad2 100644
--- a/config/common_base
+++ b/config/common_base
@@ -796,6 +796,11 @@ CONFIG_RTE_LIBRTE_PMD_IOAT_RAWDEV=y
 CONFIG_RTE_LIBRTE_PMD_OCTEONTX2_DMA_RAWDEV=y
 
 #
+# Compile PMD for octeontx2 EP raw device
+#
+CONFIG_RTE_LIBRTE_PMD_OCTEONTX2_EP_RAWDEV=y
+
+#
 # Compile PMD for NTB raw device
 #
 CONFIG_RTE_LIBRTE_PMD_NTB_RAWDEV=y
diff --git a/doc/guides/rawdevs/index.rst b/doc/guides/rawdevs/index.rst
index 22bc013..f64ec44 100644
--- a/doc/guides/rawdevs/index.rst
+++ b/doc/guides/rawdevs/index.rst
@@ -17,3 +17,4 @@ application through rawdev API.
 ioat
 ntb
 octeontx2_dma
+octeontx2_ep
diff --git a/doc/guides/rawdevs/octeontx2_ep.rst 
b/doc/guides/rawdevs/octeontx2_ep.rst
new file mode 100644
index 000..5f5ed01
--- /dev/null
+++ b/doc/guides/rawdevs/octeontx2_ep.rst
@@ -0,0 +1,41 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+Copyright(c) 2019 Marvell International Ltd.
+
+Marvell OCTEON TX2 End Point Rawdev Driver
+==
+
+OCTEON TX2 has an internal SDP unit which provides End Point mode of operation
+by exposing its IOQs to Host, IOQs are used for packet I/O between Host and
+OCTEON TX2. Each OCTEON TX2 SDP PF supports a max of 128 VFs and Each VF is
+associated with a set of IOQ pairs.
+
+Features
+
+
+This OCTEON TX2 End Point mode PMD supports
+
+#. Packet Input - Host to OCTEON TX2 with direct data instruction mode.
+
+#. Packet Output - OCTEON TX2 to Host with info pointer mode.
+
+Config File Options
+~~~
+
+The following options can be modified in the ``config`` file.
+
+- ``CONFIG_RTE_LIBRTE_PMD_OCTEONTX2_EP_RAWDEV`` (default ``y``)
+
+  Toggle compilation of the ``lrte_pmd_octeontx2_ep`` driver.
+
+Initialization
+--
+
+The number of SDP VFs enabled, can be controlled by setting sysfs
+entry `sriov_numvfs` for the corresponding PF driver.
+
+.. code-block:: console
+
+ echo  > /sys/bus/pci/drivers/octeontx2-ep/\:04\:00.0/sriov_numvfs
+
+Once the required VFs are enabled, to be accessible from DPDK, VFs need to be
+bound to vfio-pci driver.
diff --git a/drivers/raw/Makefile b/drivers/raw/Makefile
index 0b6d13d..80b043e 100644
--- a/drivers/raw/Makefile
+++ b/drivers/raw/Makefile
@@ -13,5 +13,6 @@ DIRS-$(CONFIG_RTE_LIBRTE_PMD_IFPGA_RAWDEV) += ifpga
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_IOAT_RAWDEV) += ioat
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_NTB_RAWDEV) += ntb
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_OCTEONTX2_DMA_RAWDEV) += octeontx2_dma
+DIRS-$(CONFIG_RTE_LIBRTE_PMD_OCTEONTX2_EP_RAWDEV) += octeontx2_ep
 
 include $(RTE_SDK)/mk/rte.subdir.mk
diff --git a/drivers/raw/meson.build b/drivers/raw/meson.build
index d7037cd..bb57977 100644
--- a/drivers/raw/meson.build
+++ b/drivers/raw/meson.build
@@ -4,6 +4,7 @@
 drivers = ['dpaa2_cmdif', 'dpaa2_qdma',
'ifpga', 'ioat', 'ntb',
'octeontx2_dma',
+   'octeontx2_ep',
'skeleton']
 std_deps = ['rawdev']
 config_flag_fmt = 'RTE_LIBRTE_PMD_@0@_RAWDEV'
diff --git a/drivers/raw/octeontx2_ep/Makefile 
b/drivers/raw/octeontx2_ep/Makefile
new file mode 100644
index 000..8cec6bd
--- /dev/null
+++ b/drivers/raw/octeontx2_ep/Makefile
@@ -0,0 +1,40 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(C) 2019 Marvell International Ltd.
+#
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+# Library name
+LIB = librte_rawdev_octeontx2_ep.a
+
+# Build flags
+CFLAGS += -O3
+CFLAGS += $(WERROR_FLAGS)
+
+CFLAGS += -I$(RTE_SDK)/drivers/common/octeontx2/
+CFLAGS += -I$(RTE_SDK)/drivers/raw/octeontx2_ep/
+
+LDLIBS += -lrte_eal
+LDLIBS += -lrte_rawdev
+LDLIBS += -lrte_bus_pci
+LDLIBS += -lrte_mempool
+LDLIBS += -lrte_com

[dpdk-dev] [dpdk-dev PATCH v2 3/6] raw/octeontx2_ep: add device uninitialization

2019-12-27 Thread Mahipal Challa
Add rawdev close/uninitialize operation for SDP
VF devices.

Signed-off-by: Mahipal Challa 
---
 drivers/raw/octeontx2_ep/otx2_ep_enqdeq.c | 111 ++
 drivers/raw/octeontx2_ep/otx2_ep_rawdev.c |  78 +
 drivers/raw/octeontx2_ep/otx2_ep_rawdev.h |   8 +++
 drivers/raw/octeontx2_ep/otx2_ep_vf.c |  44 
 4 files changed, 241 insertions(+)

diff --git a/drivers/raw/octeontx2_ep/otx2_ep_enqdeq.c 
b/drivers/raw/octeontx2_ep/otx2_ep_enqdeq.c
index 8857004..584b818 100644
--- a/drivers/raw/octeontx2_ep/otx2_ep_enqdeq.c
+++ b/drivers/raw/octeontx2_ep/otx2_ep_enqdeq.c
@@ -21,6 +21,59 @@
 #include "otx2_common.h"
 #include "otx2_ep_enqdeq.h"
 
+static void
+sdp_dmazone_free(const struct rte_memzone *mz)
+{
+   const struct rte_memzone *mz_tmp;
+   int ret = 0;
+
+   if (mz == NULL) {
+   otx2_err("Memzone %s : NULL", mz->name);
+   return;
+   }
+
+   mz_tmp = rte_memzone_lookup(mz->name);
+   if (mz_tmp == NULL) {
+   otx2_err("Memzone %s Not Found", mz->name);
+   return;
+   }
+
+   ret = rte_memzone_free(mz);
+   if (ret)
+   otx2_err("Memzone free failed : ret = %d", ret);
+
+}
+
+/* Free IQ resources */
+int
+sdp_delete_iqs(struct sdp_device *sdpvf, uint32_t iq_no)
+{
+   struct sdp_instr_queue *iq;
+
+   iq = sdpvf->instr_queue[iq_no];
+   if (iq == NULL) {
+   otx2_err("Invalid IQ[%d]\n", iq_no);
+   return -ENOMEM;
+   }
+
+   rte_free(iq->req_list);
+   iq->req_list = NULL;
+
+   if (iq->iq_mz) {
+   sdp_dmazone_free(iq->iq_mz);
+   iq->iq_mz = NULL;
+   }
+
+   rte_free(sdpvf->instr_queue[iq_no]);
+   sdpvf->instr_queue[iq_no] = NULL;
+
+   sdpvf->num_iqs--;
+
+   otx2_info("IQ[%d] is deleted", iq_no);
+
+   return 0;
+}
+
 /* IQ initialization */
 static int
 sdp_init_instr_queue(struct sdp_device *sdpvf, int iq_no)
@@ -126,6 +179,7 @@
return 0;
 
 delete_IQ:
+   sdp_delete_iqs(sdpvf, iq_no);
return -ENOMEM;
 }
 
@@ -139,6 +193,61 @@
rte_atomic64_set(&droq->pkts_pending, 0);
 }
 
+static void
+sdp_droq_destroy_ring_buffers(struct sdp_device *sdpvf,
+   struct sdp_droq *droq)
+{
+   uint32_t idx;
+
+   for (idx = 0; idx < droq->nb_desc; idx++) {
+   if (droq->recv_buf_list[idx].buffer) {
+   rte_mempool_put(sdpvf->enqdeq_mpool,
+   droq->recv_buf_list[idx].buffer);
+
+   droq->recv_buf_list[idx].buffer = NULL;
+   }
+   }
+
+   sdp_droq_reset_indices(droq);
+}
+
+/* Free OQs resources */
+int
+sdp_delete_oqs(struct sdp_device *sdpvf, uint32_t oq_no)
+{
+   struct sdp_droq *droq;
+
+   droq = sdpvf->droq[oq_no];
+   if (droq == NULL) {
+   otx2_err("Invalid droq[%d]", oq_no);
+   return -ENOMEM;
+   }
+
+   sdp_droq_destroy_ring_buffers(sdpvf, droq);
+   rte_free(droq->recv_buf_list);
+   droq->recv_buf_list = NULL;
+
+   if (droq->info_mz) {
+   sdp_dmazone_free(droq->info_mz);
+   droq->info_mz = NULL;
+   }
+
+   if (droq->desc_ring_mz) {
+   sdp_dmazone_free(droq->desc_ring_mz);
+   droq->desc_ring_mz = NULL;
+   }
+
+   memset(droq, 0, SDP_DROQ_SIZE);
+
+   rte_free(sdpvf->droq[oq_no]);
+   sdpvf->droq[oq_no] = NULL;
+
+   sdpvf->num_oqs--;
+
+   otx2_info("OQ[%d] is deleted", oq_no);
+   return 0;
+}
+
 static int
 sdp_droq_setup_ring_buffers(struct sdp_device *sdpvf,
struct sdp_droq *droq)
@@ -290,5 +399,7 @@
return 0;
 
 delete_OQ:
+   sdp_delete_oqs(sdpvf, oq_no);
return -ENOMEM;
 }
+
diff --git a/drivers/raw/octeontx2_ep/otx2_ep_rawdev.c 
b/drivers/raw/octeontx2_ep/otx2_ep_rawdev.c
index 0c56609..3db5a74 100644
--- a/drivers/raw/octeontx2_ep/otx2_ep_rawdev.c
+++ b/drivers/raw/octeontx2_ep/otx2_ep_rawdev.c
@@ -63,6 +63,45 @@
 }
 
 static int
+sdp_vfdev_exit(struct rte_rawdev *rawdev)
+{
+   struct sdp_device *sdpvf;
+   uint32_t rawdev_queues, q;
+
+   otx2_info("%s:", __func__);
+
+   sdpvf = (struct sdp_device *)rawdev->dev_private;
+
+   sdpvf->fn_list.disable_io_queues(sdpvf);
+
+   rawdev_queues = sdpvf->num_oqs;
+   for (q = 0; q < rawdev_queues; q++) {
+   if (sdp_delete_oqs(sdpvf, q)) {
+   otx2_err("Failed to delete OQ:%d", q);
+   return -ENOMEM;
+   }
+   }
+   otx2_info("Num OQs:%d freed", sdpvf->num_oqs);
+
+   /* Free the oqbuf_pool */
+   rte_mempool_free(sdpvf->enqdeq_mpool);
+   sdpvf->enqdeq_mpool = NULL;
+
+   otx2_info("Enqdeq_mpool free done");
+
+   rawdev_queues = sdpvf->num_iqs;
+   for (q = 0; q < rawdev_queues; q++) {
+   if (sdp_delete_iqs(sdpvf, q)) {
+

[dpdk-dev] [dpdk-dev PATCH v2 4/6] raw/octeontx2_ep: add enqueue operation

2019-12-27 Thread Mahipal Challa
Add rawdev enqueue operation for SDP VF devices.

Signed-off-by: Mahipal Challa 
---
 doc/guides/rawdevs/octeontx2_ep.rst   |   6 +
 drivers/raw/octeontx2_ep/otx2_ep_enqdeq.c | 242 ++
 drivers/raw/octeontx2_ep/otx2_ep_enqdeq.h |  39 +
 drivers/raw/octeontx2_ep/otx2_ep_rawdev.c |   1 +
 drivers/raw/octeontx2_ep/otx2_ep_rawdev.h |  20 +++
 drivers/raw/octeontx2_ep/otx2_ep_vf.c |  24 +++
 6 files changed, 332 insertions(+)

diff --git a/doc/guides/rawdevs/octeontx2_ep.rst 
b/doc/guides/rawdevs/octeontx2_ep.rst
index 2507fcf..39a7c29 100644
--- a/doc/guides/rawdevs/octeontx2_ep.rst
+++ b/doc/guides/rawdevs/octeontx2_ep.rst
@@ -68,3 +68,9 @@ The following code shows how the device is configured
 
rte_rawdev_configure(dev_id, (rte_rawdev_obj_t)&rdev_info);
 
+Performing Data Transfer
+
+
+To perform data transfer using SDP VF EP rawdev devices use standard
+``rte_rawdev_enqueue_buffers()`` and ``rte_rawdev_dequeue_buffers()`` APIs.
+
diff --git a/drivers/raw/octeontx2_ep/otx2_ep_enqdeq.c 
b/drivers/raw/octeontx2_ep/otx2_ep_enqdeq.c
index 584b818..6910f08 100644
--- a/drivers/raw/octeontx2_ep/otx2_ep_enqdeq.c
+++ b/drivers/raw/octeontx2_ep/otx2_ep_enqdeq.c
@@ -403,3 +403,245 @@
return -ENOMEM;
 }
 
+static inline void
+sdp_iqreq_delete(struct sdp_device *sdpvf,
+   struct sdp_instr_queue *iq, uint32_t idx)
+{
+   uint32_t reqtype;
+   void *buf;
+
+   buf = iq->req_list[idx].buf;
+   reqtype = iq->req_list[idx].reqtype;
+
+   switch (reqtype) {
+   case SDP_REQTYPE_NORESP:
+   rte_mempool_put(sdpvf->enqdeq_mpool, buf);
+   otx2_sdp_dbg("IQ buffer freed at idx[%d]", idx);
+   break;
+
+   case SDP_REQTYPE_NORESP_GATHER:
+   case SDP_REQTYPE_NONE:
+   default:
+   otx2_info("This iqreq mode is not supported:%d", reqtype);
+
+   }
+
+   /* Reset the request list at this index */
+   iq->req_list[idx].buf = NULL;
+   iq->req_list[idx].reqtype = 0;
+}
+
+static inline void
+sdp_iqreq_add(struct sdp_instr_queue *iq, void *buf,
+   uint32_t reqtype)
+{
+   iq->req_list[iq->host_write_index].buf = buf;
+   iq->req_list[iq->host_write_index].reqtype = reqtype;
+
+   otx2_sdp_dbg("IQ buffer added at idx[%d]", iq->host_write_index);
+
+}
+
+static void
+sdp_flush_iq(struct sdp_device *sdpvf,
+   struct sdp_instr_queue *iq,
+   uint32_t pending_thresh __rte_unused)
+{
+   uint32_t instr_processed = 0;
+
+   rte_spinlock_lock(&iq->lock);
+
+   iq->otx_read_index = sdpvf->fn_list.update_iq_read_idx(iq);
+   while (iq->flush_index != iq->otx_read_index) {
+   /* Free the IQ data buffer to the pool */
+   sdp_iqreq_delete(sdpvf, iq, iq->flush_index);
+   iq->flush_index =
+   sdp_incr_index(iq->flush_index, 1, iq->nb_desc);
+
+   instr_processed++;
+   }
+
+   iq->stats.instr_processed = instr_processed;
+   rte_atomic64_sub(&iq->instr_pending, instr_processed);
+
+   rte_spinlock_unlock(&iq->lock);
+}
+
+static inline void
+sdp_ring_doorbell(struct sdp_device *sdpvf __rte_unused,
+   struct sdp_instr_queue *iq)
+{
+   otx2_write64(iq->fill_cnt, iq->doorbell_reg);
+
+   /* Make sure doorbell write goes through */
+   rte_cio_wmb();
+   iq->fill_cnt = 0;
+
+}
+
+static inline int
+post_iqcmd(struct sdp_instr_queue *iq, uint8_t *iqcmd)
+{
+   uint8_t *iqptr, cmdsize;
+
+   /* This ensures that the read index does not wrap around to
+* the same position if queue gets full before OCTEON TX2 could
+* fetch any instr.
+*/
+   if (rte_atomic64_read(&iq->instr_pending) >=
+ (int32_t)(iq->nb_desc - 1)) {
+   otx2_err("IQ is full, pending:%ld",
+(long)rte_atomic64_read(&iq->instr_pending));
+
+   return SDP_IQ_SEND_FAILED;
+   }
+
+   /* Copy cmd into iq */
+   cmdsize = ((iq->iqcmd_64B) ? 64 : 32);
+   iqptr   = iq->base_addr + (cmdsize * iq->host_write_index);
+
+   rte_memcpy(iqptr, iqcmd, cmdsize);
+
+   otx2_sdp_dbg("IQ cmd posted @ index:%d", iq->host_write_index);
+
+   /* Increment the host write index */
+   iq->host_write_index =
+   sdp_incr_index(iq->host_write_index, 1, iq->nb_desc);
+
+   iq->fill_cnt++;
+
+   /* Flush the command into memory. We need to be sure the data
+* is in memory before indicating that the instruction is
+* pending.
+*/
+   rte_io_wmb();
+   rte_atomic64_inc(&iq->instr_pending);
+
+   /* SDP_IQ_SEND_SUCCESS */
+   return 0;
+}
+
+
+static int
+sdp_send_data(struct sdp_device *sdpvf,
+ struct sdp_instr_queue *iq, void *cmd)
+{
+   uint32_t ret;
+
+   /* Lock this IQ command queue before posting instruction */
+   rte_

[dpdk-dev] [dpdk-dev PATCH v2 5/6] raw/octeontx2_ep: add dequeue operation

2019-12-27 Thread Mahipal Challa
Add rawdev dequeue operation for SDP VF devices.

Signed-off-by: Mahipal Challa 
---
 drivers/raw/octeontx2_ep/otx2_ep_enqdeq.c | 197 ++
 drivers/raw/octeontx2_ep/otx2_ep_enqdeq.h |   2 +
 drivers/raw/octeontx2_ep/otx2_ep_rawdev.c |   1 +
 drivers/raw/octeontx2_ep/otx2_ep_rawdev.h |  18 ++-
 4 files changed, 217 insertions(+), 1 deletion(-)

diff --git a/drivers/raw/octeontx2_ep/otx2_ep_enqdeq.c 
b/drivers/raw/octeontx2_ep/otx2_ep_enqdeq.c
index 6910f08..dd270b5 100644
--- a/drivers/raw/octeontx2_ep/otx2_ep_enqdeq.c
+++ b/drivers/raw/octeontx2_ep/otx2_ep_enqdeq.c
@@ -260,6 +260,7 @@
rte_mempool_get(sdpvf->enqdeq_mpool, &buf);
if (buf == NULL) {
otx2_err("OQ buffer alloc failed");
+   droq->stats.rx_alloc_failure++;
/* sdp_droq_destroy_ring_buffers(droq);*/
return -ENOMEM;
}
@@ -645,3 +646,199 @@
return SDP_IQ_SEND_FAILED;
 }
 
+static uint32_t
+sdp_droq_refill(struct sdp_device *sdpvf, struct sdp_droq *droq)
+{
+   struct sdp_droq_desc *desc_ring;
+   uint32_t desc_refilled = 0;
+   void *buf = NULL;
+
+   desc_ring = droq->desc_ring;
+
+   while (droq->refill_count && (desc_refilled < droq->nb_desc)) {
+   /* If a valid buffer exists (happens if there is no dispatch),
+* reuse the buffer, else allocate.
+*/
+   if (droq->recv_buf_list[droq->refill_idx].buffer != NULL)
+   break;
+
+   rte_mempool_get(sdpvf->enqdeq_mpool, &buf);
+   /* If a buffer could not be allocated, no point in
+* continuing
+*/
+   if (buf == NULL) {
+   droq->stats.rx_alloc_failure++;
+   break;
+   }
+
+   droq->recv_buf_list[droq->refill_idx].buffer = buf;
+   desc_ring[droq->refill_idx].buffer_ptr = rte_mem_virt2iova(buf);
+
+   /* Reset any previous values in the length field. */
+   droq->info_list[droq->refill_idx].length = 0;
+
+   droq->refill_idx = sdp_incr_index(droq->refill_idx, 1,
+   droq->nb_desc);
+
+   desc_refilled++;
+   droq->refill_count--;
+
+   }
+
+   return desc_refilled;
+}
+
+static int
+sdp_droq_read_packet(struct sdp_device *sdpvf __rte_unused,
+struct sdp_droq *droq,
+struct sdp_droq_pkt *droq_pkt)
+{
+   struct sdp_droq_info *info;
+   uint32_t total_len = 0;
+   uint32_t pkt_len = 0;
+
+   info = &droq->info_list[droq->read_idx];
+   sdp_swap_8B_data((uint64_t *)&info->length, 1);
+   if (!info->length) {
+   otx2_err("OQ info_list->length[%ld]", (long)info->length);
+   goto oq_read_fail;
+   }
+
+   /* Deduce the actual data size */
+   info->length -= SDP_RH_SIZE;
+   total_len += (uint32_t)info->length;
+
+   otx2_sdp_dbg("OQ: pkt_len[%ld], buffer_size %d",
+   (long)info->length, droq->buffer_size);
+   if (info->length > droq->buffer_size) {
+   otx2_err("This mode is not supported: pkt_len > buffer_size");
+   goto oq_read_fail;
+   }
+
+   if (info->length <= droq->buffer_size) {
+   pkt_len = (uint32_t)info->length;
+   droq_pkt->data = droq->recv_buf_list[droq->read_idx].buffer;
+   droq_pkt->len  = pkt_len;
+
+   droq->recv_buf_list[droq->read_idx].buffer = NULL;
+   droq->read_idx = sdp_incr_index(droq->read_idx, 1,/* count */
+   droq->nb_desc /* max rd idx */);
+   droq->refill_count++;
+
+   }
+
+   info->length = 0;
+
+   return SDP_OQ_RECV_SUCCESS;
+
+oq_read_fail:
+   return SDP_OQ_RECV_FAILED;
+}
+
+static inline uint32_t
+sdp_check_droq_pkts(struct sdp_droq *droq, uint32_t burst_size)
+{
+   uint32_t min_pkts = 0;
+   uint32_t new_pkts;
+   uint32_t pkt_count;
+
+   /* Latest available OQ packets */
+   pkt_count = rte_read32(droq->pkts_sent_reg);
+
+   /* Newly arrived packets */
+   new_pkts = pkt_count - droq->last_pkt_count;
+   otx2_sdp_dbg("Recvd [%d] new OQ pkts", new_pkts);
+
+   min_pkts = (new_pkts > burst_size) ? burst_size : new_pkts;
+   if (min_pkts) {
+   rte_atomic64_add(&droq->pkts_pending, min_pkts);
+   /* Back up the aggregated packet count so far */
+   droq->last_pkt_count += min_pkts;
+   }
+
+   return min_pkts;
+}
+
+/* Check for response arrival from OCTEON TX2
+ * returns number of requests completed
+ */
+int
+sdp_rawdev_dequeue(struct rte_rawdev *rawdev,
+  struct rte_rawdev_buf **buffers, unsigned int count,
+  rte_rawdev_obj_t context __rte_unuse

[dpdk-dev] [dpdk-dev PATCH v2 6/6] raw/octeontx2_ep: add driver self test

2019-12-27 Thread Mahipal Challa
Add rawdev's selftest feature in SDP VF driver, which
verifies the EP mode functionality test.

Signed-off-by: Mahipal Challa 
---
 doc/guides/rawdevs/octeontx2_ep.rst   |  13 +++
 drivers/raw/octeontx2_ep/Makefile |   1 +
 drivers/raw/octeontx2_ep/meson.build  |   1 +
 drivers/raw/octeontx2_ep/otx2_ep_rawdev.c |   1 +
 drivers/raw/octeontx2_ep/otx2_ep_rawdev.h |   2 +
 drivers/raw/octeontx2_ep/otx2_ep_test.c   | 164 ++
 6 files changed, 182 insertions(+)

diff --git a/doc/guides/rawdevs/octeontx2_ep.rst 
b/doc/guides/rawdevs/octeontx2_ep.rst
index 39a7c29..bbcf530 100644
--- a/doc/guides/rawdevs/octeontx2_ep.rst
+++ b/doc/guides/rawdevs/octeontx2_ep.rst
@@ -74,3 +74,16 @@ Performing Data Transfer
 To perform data transfer using SDP VF EP rawdev devices use standard
 ``rte_rawdev_enqueue_buffers()`` and ``rte_rawdev_dequeue_buffers()`` APIs.
 
+Self test
+-
+
+On EAL initialization, SDP VF devices will be probed and populated into the
+raw devices. The rawdev ID of the device can be obtained using
+
+* Invoke ``rte_rawdev_get_dev_id("SDPEP:x")`` from the test application
+  where x is the VF device's bus id specified in "bus:device.func"(BDF)
+  format. Use this index for further rawdev function calls.
+
+* The driver's selftest rawdev API can be used to verify the SDP EP mode
+  functional tests which can send/receive the raw data packets to/from the
+  EP device.
diff --git a/drivers/raw/octeontx2_ep/Makefile 
b/drivers/raw/octeontx2_ep/Makefile
index 02853fb..44fdf89 100644
--- a/drivers/raw/octeontx2_ep/Makefile
+++ b/drivers/raw/octeontx2_ep/Makefile
@@ -37,6 +37,7 @@ LIBABIVER := 1
 #
 SRCS-$(CONFIG_RTE_LIBRTE_PMD_OCTEONTX2_EP_RAWDEV) += otx2_ep_rawdev.c
 SRCS-$(CONFIG_RTE_LIBRTE_PMD_OCTEONTX2_EP_RAWDEV) += otx2_ep_enqdeq.c
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_OCTEONTX2_EP_RAWDEV) += otx2_ep_test.c
 SRCS-$(CONFIG_RTE_LIBRTE_PMD_OCTEONTX2_EP_RAWDEV) += otx2_ep_vf.c
 
 
diff --git a/drivers/raw/octeontx2_ep/meson.build 
b/drivers/raw/octeontx2_ep/meson.build
index 99e6c6d..0e6338f 100644
--- a/drivers/raw/octeontx2_ep/meson.build
+++ b/drivers/raw/octeontx2_ep/meson.build
@@ -5,4 +5,5 @@
 deps += ['bus_pci', 'common_octeontx2', 'rawdev']
 sources = files('otx2_ep_rawdev.c',
'otx2_ep_enqdeq.c',
+   'otx2_ep_test.c',
'otx2_ep_vf.c')
diff --git a/drivers/raw/octeontx2_ep/otx2_ep_rawdev.c 
b/drivers/raw/octeontx2_ep/otx2_ep_rawdev.c
index 7158b97..0778603 100644
--- a/drivers/raw/octeontx2_ep/otx2_ep_rawdev.c
+++ b/drivers/raw/octeontx2_ep/otx2_ep_rawdev.c
@@ -253,6 +253,7 @@
.dev_close  = sdp_rawdev_close,
.enqueue_bufs   = sdp_rawdev_enqueue,
.dequeue_bufs   = sdp_rawdev_dequeue,
+   .dev_selftest   = sdp_rawdev_selftest,
 };
 
 static int
diff --git a/drivers/raw/octeontx2_ep/otx2_ep_rawdev.h 
b/drivers/raw/octeontx2_ep/otx2_ep_rawdev.h
index a77cbab..dab2fb7 100644
--- a/drivers/raw/octeontx2_ep/otx2_ep_rawdev.h
+++ b/drivers/raw/octeontx2_ep/otx2_ep_rawdev.h
@@ -494,4 +494,6 @@ int sdp_rawdev_enqueue(struct rte_rawdev *dev, struct 
rte_rawdev_buf **buffers,
 int sdp_rawdev_dequeue(struct rte_rawdev *dev, struct rte_rawdev_buf **buffers,
   unsigned int count, rte_rawdev_obj_t context);
 
+int sdp_rawdev_selftest(uint16_t dev_id);
+
 #endif /* _OTX2_EP_RAWDEV_H_ */
diff --git a/drivers/raw/octeontx2_ep/otx2_ep_test.c 
b/drivers/raw/octeontx2_ep/otx2_ep_test.c
new file mode 100644
index 000..96fedb5
--- /dev/null
+++ b/drivers/raw/octeontx2_ep/otx2_ep_test.c
@@ -0,0 +1,164 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2019 Marvell International Ltd.
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+
+#include "otx2_common.h"
+#include "otx2_ep_rawdev.h"
+
+#define SDP_IOQ_NUM_BUFS   (4 * 1024)
+#define SDP_IOQ_BUF_SIZE   (2 * 1024)
+
+#define SDP_TEST_PKT_FSZ   (0)
+#define SDP_TEST_PKT_SIZE  (1024)
+
+static int
+sdp_validate_data(struct sdp_droq_pkt *oq_pkt, uint8_t *iq_pkt,
+ uint32_t pkt_len)
+{
+   if (!oq_pkt)
+   return -EINVAL;
+
+   if (pkt_len != oq_pkt->len) {
+   otx2_err("Invalid packet length");
+   return -EINVAL;
+   }
+
+   if (memcmp(oq_pkt->data, iq_pkt, pkt_len) != 0) {
+   otx2_err("Data validation failed");
+   return -EINVAL;
+   }
+   otx2_sdp_dbg("Data validation successful");
+
+   return 0;
+}
+
+static void
+sdp_ioq_buffer_fill(uint8_t *addr, uint32_t len)
+{
+   uint32_t idx;
+
+   memset(addr, 0, len);
+
+   for (idx = 0; idx < len; idx++)
+   addr[idx] = idx;
+}
+
+static struct rte_mempool*
+sdp_ioq_mempool_create(void)
+{
+   struct rte_mempool *mpool;
+
+   mpool = rte_mempool_create("ioqbuf_pool",
+  SDP_IOQ_NUM_BUFS /*num elt*/,
+  SDP_IOQ_BUF

Re: [dpdk-dev] [PATCH v2 04/11] examples/l3fwd: add ethdev setup based on eventdev

2019-12-27 Thread Nipun Gupta



> -Original Message-
> From: dev  On Behalf Of
> pbhagavat...@marvell.com
> Sent: Wednesday, December 4, 2019 8:14 PM
> To: jer...@marvell.com; Marko Kovacevic ; Ori
> Kam ; Bruce Richardson
> ; Radu Nicolau ;
> Akhil Goyal ; Tomasz Kantecki
> ; Sunil Kumar Kori ;
> Pavan Nikhilesh 
> Cc: dev@dpdk.org
> Subject: [dpdk-dev] [PATCH v2 04/11] examples/l3fwd: add ethdev setup
> based on eventdev
> 
> From: Sunil Kumar Kori 
> 
> Add ethernet port Rx/Tx queue setup for event device which are later
> used for setting up event eth Rx/Tx adapters.
> 
> Signed-off-by: Sunil Kumar Kori 
> ---
>  examples/l3fwd/l3fwd.h   |  10 +++
>  examples/l3fwd/l3fwd_event.c | 129
> ++-
>  examples/l3fwd/l3fwd_event.h |   2 +-
>  examples/l3fwd/main.c|  15 ++--
>  4 files changed, 144 insertions(+), 12 deletions(-)
> 



> +
> + local_port_conf.rx_adv_conf.rss_conf.rss_hf &=
> +
>   dev_info.flow_type_rss_offloads;
> + if (local_port_conf.rx_adv_conf.rss_conf.rss_hf !=
> + port_conf->rx_adv_conf.rss_conf.rss_hf) {
> + printf("Port %u modified RSS hash function "
> +"based on hardware support,"
> +"requested:%#"PRIx64"
> configured:%#"PRIx64"\n",
> +port_id,
> +port_conf->rx_adv_conf.rss_conf.rss_hf,
> +local_port_conf.rx_adv_conf.rss_conf.rss_hf);
> + }

We are using 1 queue, but using RSS hash function?

> +
> + ret = rte_eth_dev_configure(port_id, 1, 1, &local_port_conf);
> + if (ret < 0)
> + rte_exit(EXIT_FAILURE,
> +  "Cannot configure device: err=%d,
> port=%d\n",
> +  ret, port_id);
> +

We should be using number of RX queues as per the config option provided in the 
arguments.
L3fwd is supposed to support multiple queue. Right?

Regards,
Nipun




Re: [dpdk-dev] questions about new offload ethdev api

2019-12-27 Thread Olivier Matz
Hi,

Few comments below.

On Mon, Dec 16, 2019 at 11:39:05AM +0300, Andrew Rybchenko wrote:
> On 12/10/19 9:07 PM, Ferruh Yigit wrote:
> > On 1/23/2018 2:34 PM, Shahaf Shuler wrote:
> >> Tuesday, January 23, 2018 3:53 PM, Olivier Matz:
> > 
> > <...>
> > 
> >>>
> >>> 2/ meaning of rxmode.jumbo_frame, rxmode.enable_scatter, 
> >>> rxmode.max_rx_pkt_len
> >>>
> >>> While it's not related to the new API, it is probably a good opportunity
> >>> to clarify the meaning of these flags. I'm not able to find a good
> >>> documentation about them.
> >>>
> >>> Here is my understanding, the configuration only depends on: - the maximum
> >>> rx frame length - the amount of data available in a mbuf (minus headroom)
> >>>
> >>> Flags to set in rxmode (example): 
> >>> +---+++-+ |
> >>> |mbuf_data_len=1K|mbuf_data_len=2K|mbuf_data_len=16K| 
> >>> +---+++-+ 
> >>> |max_rx_len=1500|enable_scatter  || | 
> >>> +---+++-+ 
> >>> |max_rx_len=9000|enable_scatter, |enable_scatter, |jumbo_frame  | |
> >>> |jumbo_frame |jumbo_frame | | 
> >>> +---+++-+

Due to successive quotes, the table was not readable in my mail client,
here it is again (narrower):

++---+---+---+
||mbuf_data_len= |mbuf_data_len= |mbuf_data_len= |
||1K |2K |16K|
++---+---+---+
|max_rx_len= |enable_scatter |   |   |
|1500|   |   |   |
++---+---+---+
|max_rx_len= |enable_scatter,|enable_scatter,|jumbo_frame|
|9000|jumbo_frame|jumbo_frame|   |
++---+---+---+

> >>> If this table is correct, the flag jumbo_frame would be equivalent to 
> >>> check
> >>> if max_rx_pkt_len is above a threshold.
> >>>
> >>> And enable_scatter could be deduced from the mbuf size of the given rxq 
> >>> (which is a bit harder but maybe doable).
> >>
> >> I glad you raised this subject. We had a lot of discussion on it internally
> >> in Mellanox.
> >>
> >> I fully agree. All application needs is to specify the maximum packet size 
> >> it
> >> wants to receive.
> >>
> >> I think also the lack of documentation is causing PMDs to use those flags
> >> wrongly. For example - some PMDs set the jumbo_frame flag internally 
> >> without
> >> it being set by the application.
> >>
> >> I would like to add one more item : MTU. What is the relation (if any)
> >> between setting MTU and the max_rx_len ? I know MTU stands for Max Transmit
> >> Unit, however at least in Linux it is the same for the Send and the 
> >> receive.
> >>
> >>
> > 
> > (Resurrecting the thread after two years, I will reply again with latest
> > understanding.)
> > 
> > Thanks Olivier for above summary and table, and unfortunately usage still 
> > not
> > consistent between PMDs. According my understanding:
> > 
> > 'max_rx_pkt_len' is user configuration value, to limit the size packet that 
> > is
> > shared with host, but this doesn't limit the size of packet that NIC 
> > receives.

When you say the size of packet shared with the host, do you mean for
instance that the NIC will receive a 1500B packet and will only write
128 bytes of data in the mbuf?

If yes, this was not my understanding. I suppose it could be used for
monitoring. What should be the value for rx offload infos like checksum
or packet type if the packet (or the header) is truncated?

> Also comment in lib/librte_ethdev/rte_ethdev.h says that the
> rxmode field is used if (and I think only if) JUMBO_FRAME is
> enabled. So, if user wants to set it on device configure stage,
> device *must* support JUMBO_FRAME offload which mean that
> driver code handles rxmode.max_rx_pkt_len and either accept it
> and configures HW appropriately or return an error if specified
> value is wrong. Basically it is written in jumbo frame feature
> definition in features.rst. User has max_rx_pktlen in dev_info
> to find out maximum supported value for max_rx_pkt_len.
> 
> > Like if the mbuf size of the mempool used by a queue is 1024 bytes, we don't
> > want packets bigger than buffer size, but if NIC supports it is possible 
> > receive
> > 6000 bytes packet and split data into multiple buffers, and we can use multi
> > segment packets to represent it.
> > So what we need is NIC ability to limit the size of data to share to host 
> > and
> > scattered Rx support (device + driver).
> 
> It requires RX_SCATTER offload enabled and it must be
> controlled by the user only (not PMD) since it basically
> mean if the application is ready to handle

Re: [dpdk-dev] questions about new offload ethdev api

2019-12-27 Thread Ananyev, Konstantin



> -Original Message-
> From: Olivier Matz 
> Sent: Friday, December 27, 2019 1:54 PM
> To: Andrew Rybchenko 
> Cc: Yigit, Ferruh ; Shahaf Shuler 
> ; dev@dpdk.org; Ananyev, Konstantin
> ; Thomas Monjalon ; 
> Richardson, Bruce ; Matan
> Azrad ; Jerin Jacob Kollanukkaran 
> Subject: Re: [dpdk-dev] questions about new offload ethdev api
> 
> Hi,
> 
> Few comments below.
> 
> On Mon, Dec 16, 2019 at 11:39:05AM +0300, Andrew Rybchenko wrote:
> > On 12/10/19 9:07 PM, Ferruh Yigit wrote:
> > > On 1/23/2018 2:34 PM, Shahaf Shuler wrote:
> > >> Tuesday, January 23, 2018 3:53 PM, Olivier Matz:
> > >
> > > <...>
> > >
> > >>>
> > >>> 2/ meaning of rxmode.jumbo_frame, rxmode.enable_scatter,
> > >>> rxmode.max_rx_pkt_len
> > >>>
> > >>> While it's not related to the new API, it is probably a good opportunity
> > >>> to clarify the meaning of these flags. I'm not able to find a good
> > >>> documentation about them.
> > >>>
> > >>> Here is my understanding, the configuration only depends on: - the 
> > >>> maximum
> > >>> rx frame length - the amount of data available in a mbuf (minus 
> > >>> headroom)
> > >>>
> > >>> Flags to set in rxmode (example):
> > >>> +---+++-+ |
> > >>> |mbuf_data_len=1K|mbuf_data_len=2K|mbuf_data_len=16K|
> > >>> +---+++-+
> > >>> |max_rx_len=1500|enable_scatter  || |
> > >>> +---+++-+
> > >>> |max_rx_len=9000|enable_scatter, |enable_scatter, |jumbo_frame  | |
> > >>> |jumbo_frame |jumbo_frame | |
> > >>> +---+++-+
> 
> Due to successive quotes, the table was not readable in my mail client,
> here it is again (narrower):
> 
> ++---+---+---+
> ||mbuf_data_len= |mbuf_data_len= |mbuf_data_len= |
> ||1K |2K |16K|
> ++---+---+---+
> |max_rx_len= |enable_scatter |   |   |
> |1500|   |   |   |
> ++---+---+---+
> |max_rx_len= |enable_scatter,|enable_scatter,|jumbo_frame|
> |9000|jumbo_frame|jumbo_frame|   |
> ++---+---+---+
> 
> > >>> If this table is correct, the flag jumbo_frame would be equivalent to 
> > >>> check
> > >>> if max_rx_pkt_len is above a threshold.
> > >>>
> > >>> And enable_scatter could be deduced from the mbuf size of the given rxq
> > >>> (which is a bit harder but maybe doable).
> > >>
> > >> I glad you raised this subject. We had a lot of discussion on it 
> > >> internally
> > >> in Mellanox.
> > >>
> > >> I fully agree. All application needs is to specify the maximum packet 
> > >> size it
> > >> wants to receive.
> > >>
> > >> I think also the lack of documentation is causing PMDs to use those flags
> > >> wrongly. For example - some PMDs set the jumbo_frame flag internally 
> > >> without
> > >> it being set by the application.
> > >>
> > >> I would like to add one more item : MTU. What is the relation (if any)
> > >> between setting MTU and the max_rx_len ? I know MTU stands for Max 
> > >> Transmit
> > >> Unit, however at least in Linux it is the same for the Send and the 
> > >> receive.
> > >>
> > >>
> > >
> > > (Resurrecting the thread after two years, I will reply again with latest
> > > understanding.)
> > >
> > > Thanks Olivier for above summary and table, and unfortunately usage still 
> > > not
> > > consistent between PMDs. According my understanding:
> > >
> > > 'max_rx_pkt_len' is user configuration value, to limit the size packet 
> > > that is
> > > shared with host, but this doesn't limit the size of packet that NIC 
> > > receives.
> 
> When you say the size of packet shared with the host, do you mean for
> instance that the NIC will receive a 1500B packet and will only write
> 128 bytes of data in the mbuf?
> 
> If yes, this was not my understanding. I suppose it could be used for
> monitoring. What should be the value for rx offload infos like checksum
> or packet type if the packet (or the header) is truncated?
> 
> > Also comment in lib/librte_ethdev/rte_ethdev.h says that the
> > rxmode field is used if (and I think only if) JUMBO_FRAME is
> > enabled. So, if user wants to set it on device configure stage,
> > device *must* support JUMBO_FRAME offload which mean that
> > driver code handles rxmode.max_rx_pkt_len and either accept it
> > and configures HW appropriately or return an error if specified
> > value is wrong. Basically it is written in jumbo frame feature
> > definition in features.rst. User has max_rx_pktlen in dev_info
> > to find out maximum supported value for max_rx_pkt_len.
> >
> > > L

Re: [dpdk-dev] [RFC PATCH] mbuf: fix to update documentation of PKT_RX_QINQ_STRIPPED

2019-12-27 Thread Olivier Matz
Hi,

On Tue, Dec 24, 2019 at 12:53:21PM +0300, Andrew Rybchenko wrote:
> On 12/24/19 6:16 AM, Somnath Kotur wrote:
> > Given that we haven't heard any objection from anyone in a while on
> > this ...can we get this in please?
> 
> I'm sorry, but have you seen below?
> It means that PKT_RX_QINQ_STRIPPED, PKT_RX_QINQ, PKT_RX_VLAN
> and PKT_RX_VLAN_STRIPPED must be clarified.
> 
> It sounds like change of semantics in order to resolve the
> problem, but anyway it is still a small change of semantics.

Let's take this packet:
packet = ether | outer-vlan | inner-vlan | ...

The possible mbufs received from a PMD, depending on configuration, are:

1/ no flag (no offload)

2/ PKT_RX_VLAN
   packet data is unmodified
   m->vlan_tci=outer-vlan

3/ PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED
   outer-vlan is removed from packet data
   m->vlan_tci=outer-vlan

4/ PKT_RX_VLAN | PKT_RX_QINQ
   packet data is unmodified
   m->vlan_tci_outer=outer-vlan
   m->vlan_tci=inner-vlan

5/ PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED | PKT_RX_QINQ

   from PKT_RX_VLAN_STRIPPED:
 A vlan has been stripped by the hardware and its tci is saved in
 mbuf->vlan_tci.
   from PKT_RX_QINQ:
 The RX packet is a double VLAN, and the outer tci has been
 saved in in mbuf->vlan_tci_outer.

   To me, it means:

   inner-vlan is removed from packet data
   m->vlan_tci_outer=outer-vlan
   m->vlan_tci=inner-vlan

6/ PKT_RX_VLAN | PKT_RX_VLAN_STRIPPED | PKT_RX_QINQ | PKT_RX_QINQ_STRIPPED
   both outer-vlan and inner-vlan removed from packet data
   m->vlan_tci_outer=outer-vlan
   m->vlan_tci=inner-vlan

Other flag combinations are not supported.
The proposed patch documents that this new flag combination is now allowed:

7/ PKT_RX_VLAN | PKT_RX_QINQ | PKT_RX_QINQ_STRIPPED
   outer-vlan is removed from packet data
   m->vlan_tci_outer=outer-vlan
   m->vlan_tci=inner-vlan

Except if I missed something, I don't see any semantic change in
the previously supported cases.

I think we should by the way clarify what happens in 5/, probably by
saying somewhere that as soon as PKT_RX_QINQ is set, PKT_RX_VLAN*
refer to inner vlan.


> BTW, it is better to make summary human readable and avoid
> PKT_RX_QINQ_STRIPPED (I guess check-git-log.sh yells on it).
> 
> Also RFC patch cannot be applied, non-RFC version is required.
> 
> CC main tree maintainers.
> 
> > Thanks
> > 
> > On Mon, Dec 16, 2019 at 2:43 PM Andrew Rybchenko
> >  wrote:
> >>
> >> On 12/16/19 11:47 AM, Somnath Kotur wrote:
> >>> On Mon, Dec 16, 2019 at 12:01 PM Andrew Rybchenko
> >>>  wrote:
> 
>  On 12/16/19 6:16 AM, Somnath Kotur wrote:
> > Certain hardware may be able to strip and/or save only the outermost
> > VLAN instead of both the VLANs in the mbuf in a QinQ scenario.
> > To handle such cases, we could re-interpret setting of just 
> > PKT_RX_QINQ_STRIPPED
> > to indicate that only the outermost VLAN has been stripped by the 
> > hardware and
> > saved in mbuf->vlan_tci_outer.
> > Only When both PKT_RX_QINQ_STRIPPED and PKT_RX_VLAN_STRIPPED are set, 
> > the 2 vlans
> > have been stripped by the hardware and their tci are saved in 
> > mbuf->vlan_tci (inner)
> > and mbuf->vlan_tci_outer (outer).
> >
> > Signed-off-by: Somnath Kotur 
> > Signed-off-by: JP Lee 
> > ---
> >  lib/librte_mbuf/rte_mbuf_core.h | 15 +++
> >  1 file changed, 11 insertions(+), 4 deletions(-)
> >
> > diff --git a/lib/librte_mbuf/rte_mbuf_core.h 
> > b/lib/librte_mbuf/rte_mbuf_core.h
> > index 9a8557d..db1070b 100644
> > --- a/lib/librte_mbuf/rte_mbuf_core.h
> > +++ b/lib/librte_mbuf/rte_mbuf_core.h
> > @@ -124,12 +124,19 @@
> >  #define PKT_RX_FDIR_FLX  (1ULL << 14)
> >
> >  /**
> > - * The 2 vlans have been stripped by the hardware and their tci are
> > - * saved in mbuf->vlan_tci (inner) and mbuf->vlan_tci_outer (outer).
> > + * The outer vlan has been stripped by the hardware and their tci are
> > + * saved in mbuf->vlan_tci_outer (outer).
> >   * This can only happen if vlan stripping is enabled in the RX
> >   * configuration of the PMD.
> > - * When PKT_RX_QINQ_STRIPPED is set, the flags (PKT_RX_VLAN |
> > - * PKT_RX_VLAN_STRIPPED | PKT_RX_QINQ) must also be set.
> > + * When PKT_RX_QINQ_STRIPPED is set, the flags (PKT_RX_VLAN |  
> > PKT_RX_QINQ)
> > + * must also be set.
> > + * When both PKT_RX_QINQ_STRIPPED and PKT_RX_VLAN_STRIPPED are set, 
> > the 2 vlans
> > + * have been stripped by the hardware and their tci are saved in
> > + * mbuf->vlan_tci (inner) and mbuf->vlan_tci_outer (outer).
> > + * This can only happen if vlan stripping is enabled in the RX 
> > configuration
> > + * of the PMD.
> > + * When PKT_RX_QINQ_STRIPPED and PKT_RX_VLAN_STRIPPED are set,
> > + * (PKT_RX_VLAN | PKT_RX_QINQ) must also be set.
> >   */
> >  #define PKT_RX_QINQ_STRIPPED (1ULL << 15)
> >
> 
> 

Re: [dpdk-dev] [PATCH] mempool: fix mempool obj alignment for non x86

2019-12-27 Thread Olivier Matz
Hi,

On Sat, Dec 21, 2019 at 10:36:15AM +0530, Jerin Jacob wrote:
> On Sat, Dec 21, 2019 at 2:37 AM Honnappa Nagarahalli
>  wrote:
> >
> > 
> > > > > From: Jerin Jacob 
> > > > >
> > > > > The exiting optimize_object_size() function address the memory
> > > > > object alignment constraint on x86 for better performance.
> > > > >
> > > > > Different (Mirco) architecture may have different memory alignment
> > > > > constraint for better performance and it not same as the existing
> > > > > optimize_object_size() function. Some use, XOR(kind of CRC) scheme
> > > > > to enable DRAM channel distribution based on the address and some
> > > > > may have a different formula.

typo: Mirco -> Micro
Maybe the whole sentence can be reworded a bit (I think a word is missing).

> > > > If I understand correctly, address interleaving is the characteristic 
> > > > of the
> > > memory controller and not the CPU.
> > > > For ex: different SoCs using the same Arm architecture might have 
> > > > different
> > > memory controllers. So, the solution should not be architecture specific, 
> > > but
> > > SoC specific.
> > >
> > > Yes.  See below.
> > >
> > > > > -static unsigned optimize_object_size(unsigned obj_size)
> > > > > +static unsigned
> > > > > +arch_mem_object_align(unsigned obj_size)
> > > > >  {
> > > > >   unsigned nrank, nchan;
> > > > >   unsigned new_obj_size;
> > > > > @@ -99,6 +101,13 @@ static unsigned optimize_object_size(unsigned
> > > > > obj_size)
> > > > >   new_obj_size++;
> > > > >   return new_obj_size * RTE_MEMPOOL_ALIGN;  }
> > > > > +#else
> > > > This applies to add Arm (PPC as well) SoCs which might have different
> > > schemes depending on the memory controller. IMO, this should not be
> > > architecture specific.
> > >
> > > I agree in principle.
> > > I will summarize the
> > > https://www.mail-archive.com/dev@dpdk.org/msg149157.html feedback:
> > >
> > > 1) For x86 arch, it is architecture-specific
> > > 2) For power PC arch, It is architecture-specific
> > > 3) For the ARM case, it will be the memory controller specific.
> > > 4) For the ARM case, The memory controller is not using the existing
> > > x86 arch formula.
> > > 5) If it is memory/arch-specific, Can userspace code find the optimal
> > > alignment? In the case of octeontx2/arm64, the memory controller does  XOR
> > > on PA address which userspace code doesn't have much control.
> > >
> > > This patch address the known case of (1), (2),  (4) and (5). (2) can be 
> > > added to
> > > this framework when POWER9 folks want it.
> > >
> > > We can extend this patch to address (3) if there is a case. Without the 
> > > actual
> > > requirement(If some can share the formula of alignment which is the
> > > memory controller specific and it does not come under (4))) then we can
> > > create extra layer for the memory controller and abstraction to probe it.
> > > Again there is no standard way of probing the memory controller in
> > > userspace and we need platform #define, which won't work for distribution
> > > build.
> > > So solution needs to be arch-specific and then fine-tune to memory 
> > > controller
> > > if possible.
> > >
> > > I can work on creating an extra layer of code if some can provide the 
> > > details
> > > of the memory controller and probing mechanism or this patch be extended
> > Inputs for BlueField, DPAAx, ThunderX2 would be helpful.
> 
> Yes. Probably memory controller used in n1sdp SoC also.
> 
> >
> > > to support such case if it arises in future.
> > >
> > > Thoughts?
> > How much memory will this save for your platform? Is it affecting 
> > performance?

Currently, I think Arm-based architectures use the default (nchan=4,
nrank=1). The worst case is for object whose size (including mempool
header) is 2 cache lines, where it is optimized to 3 cache lines (+50%).

Examples for cache lines size = 64:
  orig optimized
  64-> 64   +0%
  128   -> 192  +50%
  192   -> 192  +0%
  256   -> 320  +25%
  320   -> 320  +0%
  384   -> 448  +16%
  ...
  2304  -> 2368 +2.7%  (~mbuf size)

> No performance difference.
> 
> The existing code adding the tailer for each objs.
> Additional space/Trailer space will be function of number of objects
> in mempool  and its obj_size, its alignment and selected
> rte_memory_get_nchannel() and rte_memory_get_nrank()
> 
> I will wait for inputs from Bluefield, DPAAx, ThunderX2 and n1sdp(if
> any) for any rework on the patch.

If there is no performance impact on other supporter Arm-based
architectures, I think it is a step in a right direction.

> > > > > +static unsigned
> > > > > +arch_mem_object_align(unsigned obj_size) {
> > > > > + return obj_size;
> > > > > +}
> > > > > +#endif

I'd prefer "unsigned int" for new code.
Also, the opening brace should be on a separate line.

The documentation of the MEMPOOL_F_NO_SPREAD flag in the .h could be
slightly modified, as you did for the comment above

Re: [dpdk-dev] [PATCH v2] mbuf: display more fields in dump

2019-12-27 Thread Stephen Hemminger
On Fri, 27 Dec 2019 10:10:18 +0100
Olivier Matz  wrote:

> Hi,
> 
> On Thu, Dec 26, 2019 at 08:58:51AM -0800, Stephen Hemminger wrote:
> > On Thu, 26 Dec 2019 17:15:39 +0100
> > Olivier Matz  wrote:
> >   
> > > Hi,
> > > 
> > > On Thu, Nov 21, 2019 at 10:30:55AM -0800, Stephen Hemminger wrote:  
> > > > The rte_pktmbuf_dump should display offset, refcount, and vlan
> > > > info since these are often useful during debugging.
> > > > 
> > > > Signed-off-by: Stephen Hemminger 
> > > > ---
> > > > v2 - remove casts, change in_port to port
> > > >  the refcount and offset are property of per-segment
> > > > 
> > > >  lib/librte_mbuf/rte_mbuf.c | 17 ++---
> > > >  1 file changed, 10 insertions(+), 7 deletions(-)
> > > > 
> > > > diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c
> > > > index 35df1c4c38a5..4894d46628e3 100644
> > > > --- a/lib/librte_mbuf/rte_mbuf.c
> > > > +++ b/lib/librte_mbuf/rte_mbuf.c
> > > > @@ -473,18 +473,21 @@ rte_pktmbuf_dump(FILE *f, const struct rte_mbuf 
> > > > *m, unsigned dump_len)
> > > >  
> > > > __rte_mbuf_sanity_check(m, 1);
> > > >  
> > > > -   fprintf(f, "dump mbuf at %p, iova=%"PRIx64", buf_len=%u\n",
> > > > -  m, (uint64_t)m->buf_iova, (unsigned)m->buf_len);
> > > > -   fprintf(f, "  pkt_len=%"PRIu32", ol_flags=%"PRIx64", 
> > > > nb_segs=%u, "
> > > > -  "in_port=%u\n", m->pkt_len, m->ol_flags,
> > > > -  (unsigned)m->nb_segs, (unsigned)m->port);
> > > > +   fprintf(f, "dump mbuf at %p, iova=%#"PRIx64", buf_len=%u\n",
> > > > +   m, m->buf_iova, m->buf_len);
> > > > +   fprintf(f,
> > > > +   "  pkt_len=%u, ol_flags=%#"PRIx64", nb_segs=%u, 
> > > > port=%u, vlan_tci=%#x\n",
> > > > +   m->pkt_len, m->ol_flags, m->nb_segs, m->port, 
> > > > m->vlan_tci);
> > > > +
> > > > nb_segs = m->nb_segs;
> > > >  
> > > > while (m && nb_segs != 0) {
> > > > __rte_mbuf_sanity_check(m, 0);
> > > >  
> > > > -   fprintf(f, "  segment at %p, data=%p, data_len=%u\n",
> > > > -   m, rte_pktmbuf_mtod(m, void *), 
> > > > (unsigned)m->data_len);
> > > > +   fprintf(f, "  segment at %p, data=%p, len=%u, off=%u, 
> > > > refcnt=%u\n",
> > > > +   m, rte_pktmbuf_mtod(m, void *),
> > > > +   m->data_len, m->data_off, 
> > > > rte_mbuf_refcnt_read(m));
> > > > +
> > > > len = dump_len;
> > > > if (len > m->data_len)
> > > > len = m->data_len;
> > > > -- 
> > > > 2.20.1
> > > > 
> > > 
> > > Thanks for this enhancement.
> > > 
> > > One comment however: I don't see why vlan_tci would be important than
> > > packet type or rss tag. It is also a bit confusing to display a field
> > > which can have a random value (if the vlan flag is not set in ol_flags).
> > > 
> > > I'd prefer to dump vlan_tci only if the flag is present. Later we could
> > > add more fields with the same logic. What do you think?  
> > 
> > Because this is a debugging hook and easier to always display the value.
> > Thats all.  
> 
> This is precisely because it is a debug hook that I find dangerous to
> display a wrong value. As the mbuf are recycled, a mbuf that does not
> have a vlan_tci can display its previous value, which may look valid.
> 
> What about simply doing this?
> 
>   fprintf(f,
>   "  pkt_len=%u, ol_flags=%#"PRIx64", nb_segs=%u, port=%u",
>   m->pkt_len, m->ol_flags, m->nb_segs, m->port);
>   if (m->ol_flags & PKT_RX_VLAN)
>   fprintf(f, ", vlan_tci=%#x", m->vlan_tci);
>   fprintf(f, "\n");
> 
> If you want I can submit an updated version of your patch

The right flags are more complex, you need to deal with TX vlan.
And PKT_RX_VLAN means "this is a vlan packet" it doesn't mean
vlan tag is stripped into TCI.  That is PKT_RX_VLAN_STRIPPED