Re: [dpdk-dev] [PATCH v2] vhost: fix packed ring zero-copy

2020-02-24 Thread Maxime Coquelin
Hi David & Thomas,

On 2/24/20 4:14 PM, Marvin Liu wrote:
> Available buffer ID should be stored in the zmbuf in the packed-ring
> dequeue path. There's no guarantee that local queue avail index is
> equal to buffer ID.
> 
> Fixes: d1eafb532268 ("vhost: add packed ring zcopy batch and single dequeue")
> Cc: sta...@dpdk.org
> 
> Signed-off-by: Marvin Liu 
> Reported-by: Yinan Wang 
> Reviewed-by: Maxime Coquelin 
> 

If it is not too late, I think we should pick this patch for
v20.02. It is fixing a regression introduced in DPDK v19.11.

The risk of this patch causing a regression is very low, as it is
touching only the zero-copy packed ring dequeue path, which wihtout this
patch is anyway broken.

Thanks,
Maxime



Re: [dpdk-dev] [RFC]app/testpmd: time-consuming question of mlockall function execution

2020-02-24 Thread David Marchand
Hello,

On Mon, Feb 24, 2020 at 7:35 AM humin (Q)  wrote:
> We found that if OS transparent hugepage uses non 'always', mlockall function 
> in the main function of testpmd takes more than 25s to execute.
> The results of running on both x86 and ARM are the same. It's very 
> unreasonable and deadly. The enable status of transparent hugepage on OS can 
> be viewed by the following command.
> [root@X6000-C23-1U dpdk]#cat /sys/kernel/mm/transparent_hugepage/enabled
> always [madvise] never
>
> Transparent hugepage on ARM is configured as 'madvise', 'never' or 'always', 
> testpmd runs with using strace as follows:
> *** Transparent hugepage is configured as 
> 'madvise'  *** [root@X6000-C23-1U dpdk]# strace 
> -T -e trace=mlockall ./testpmd -l 1-4 -w :7d:01.0 --iova-mode=va -- -i
[snip]
> Interactive-mode selected
> mlockall(MCL_CURRENT|MCL_FUTURE)= 0 <25.736362>
> <-- Hang for 25 seconds
[snip]
>
> *  Transparent hugepage is configured as 'never'  
> * [root@X6000-C23-1U dpdk]# strace -T -e 
> trace=mlockall ./testpmd -l 1-4 -w :7d:01.0 --iova-mode=va -- -i
[snip]
> Interactive-mode selected
> mlockall(MCL_CURRENT|MCL_FUTURE)= 0 <25.737757>
> <-- Hang for 25 seconds
[snip]
>
> *  Transparent hugepage is configured as 'always' 
>  * [root@X6000-C23-1U dpdk]# strace -T -e 
> trace=mlockall testpmd -l 1-4 -w
> :7d:01.0 --iova-mode=va -- -i
> strace: Can't stat 'testpmd': No such file or directory [root@X6000-C23-1U 
> dpdk]# strace -T -e trace=mlockall ./testpmd -l 1-4 -w :7d:01.0 
> --iova-mode=va -- -i
[snip]
> Interactive-mode selected
> mlockall(MCL_CURRENT|MCL_FUTURE)= 0 <0.208571>
> <-- No Hang
[snip]

>
> We have also seen some discussions on this issue in following page:
> https://bugzilla.redhat.com/show_bug.cgi?id=1786923
>
> David Marchand has a related patch, as following page:
> https://github.com/david-marchand/dpdk/commit/f9e1b9fa101c9f4f16c0717401a55790aecc6484
> but this patch doesn't seem to have been submitted to the community.

Yes, this is not ready, I worked on this locally since then (and the
last version is not pushed to my github).
The main problem is that I have not been able to reproduce Eelco issue
so far which justified the addition of mlockall().


> Transparent hugepage on ARM is configured as 'madvise' or 'never', testpmd 
> runs with using strace as follows:
> ***
> [root@X6000-C23-1U app]# strace -T -e trace=mlockall ./testpmd -l 1-4 -w
> :7d:01.0 --iova-mode=va -- -i
[snip]
> Interactive-mode selected
> mlockall(MCL_CURRENT|MCL_FUTURE|MCL_ONFAULT) = 0 <1.955947>
> <-- Hang for less than 2 seconds
> testpmd: create a new mbuf pool : n=171456, size=2176, 
> socket=0
> testpmd: preferred mempool ops selected: ring_mp_mc Done
> testpmd> quit
>
> Bye...
> +++ exited with 0 +++
>
> We'd like to know what is the current development on this issue in dpdk 
> community. Thanks

There is also a report about coredumps being huge.
On this topic, there might be something to do with madvise + MADV_DONTDUMP.


I see various options from the hardest to the easiest:
- drop multiprocess support,
- rework the memory allocator so that this kind of side effect
(mapping a huge number of unused pages) is not hit,
- change testpmd so that it locks only the pages of interest (this is
the patch that I had been looking at for now),
- change testpmd so that it does not call mlockall by default,

The last option is likely the quickest workaround.

I write "options", but the last two points feel like "band aid" fixes.
And it does not solve the problem for other applications that will
have to implement similar workarounds.

Anatoly warned that touching the memory allocator is going to be hell.
Quite sad to reach this state because it feels like people are just
starting to hit the changes that entered dpdk 18.05.


Of course, other ideas welcome!

--
David Marchand



Re: [dpdk-dev] [PATCH] doc: deprecate using MAX values as array size

2020-02-24 Thread Andrew Rybchenko
On 2/21/20 1:25 PM, Ferruh Yigit wrote:
> On 1/30/2020 2:20 PM, Ferruh Yigit wrote:
>> Adding the deprecation notice as reminder for next ABI breakage release
>> (20.11).
>> This one time breakage is required to be able to extend enum/define
>> without breaking ABI.
>>
>> Signed-off-by: Ferruh Yigit 
>> ---
>>  doc/guides/rel_notes/deprecation.rst | 14 ++
>>  1 file changed, 14 insertions(+)
>>
>> diff --git a/doc/guides/rel_notes/deprecation.rst 
>> b/doc/guides/rel_notes/deprecation.rst
>> index dfcca87ab..99d81564a 100644
>> --- a/doc/guides/rel_notes/deprecation.rst
>> +++ b/doc/guides/rel_notes/deprecation.rst
>> @@ -38,6 +38,20 @@ Deprecation Notices
>>remove it from the externally visible ABI and allow it to be updated in 
>> the
>>future.
>>  
>> +* lib: will fix extending some enum/define breaking the ABI. There are 
>> multiple
>> +  samples in DPDK that enum/define terminated with a ``.*MAX.*`` value 
>> which is
>> +  used by iterators, and arrays holding these values are sized with this
>> +  ``.*MAX.*`` value. So extending this enum/define increases the ``.*MAX.*``
>> +  value which increases the size of the array and depending on how/where the
>> +  array is used this may break the ABI.
>> +  ``RTE_ETH_FLOW_MAX`` is one sample of the mentioned case, adding a new 
>> flow
>> +  type will break the ABI because of ``flex_mask[RTE_ETH_FLOW_MAX]`` array
>> +  usage in following public struct hierarchy:
>> +  ``rte_eth_fdir_flex_conf -> rte_fdir_conf -> rte_eth_conf (in the 
>> middle)``.
>> +  Need to identify this kind of usages and fix in 20.11, otherwise this 
>> blocks
>> +  us extending existing enum/define.
>> +  One solution can be using a fixed size array instead of ``.*MAX.*`` value.
>> +
>>  * dpaa2: removal of ``rte_dpaa2_memsegs`` structure which has been replaced
>>by a pa-va search library. This structure was earlier being used for 
>> holding
>>memory segments used by dpaa2 driver for faster pa->va translation. This
>>
> Reminder of this deprecation notice.

Acked-by: Andrew Rybchenko 




Re: [dpdk-dev] [PATCH] doc: fix naming of Mellanox devices

2020-02-24 Thread Slava Ovsiienko
> -Original Message-
> From: Thomas Monjalon 
> Sent: Friday, February 21, 2020 16:59
> To: dev@dpdk.org
> Cc: Gal Cohen ; Matan Azrad ;
> Shahaf Shuler ; Slava Ovsiienko
> ; John McNamara ;
> Marko Kovacevic 
> Subject: [PATCH] doc: fix naming of Mellanox devices
> 
> The devices of the family ConnectX may have two letters as suffix.
> Such suffix is preceded with a space and the second x is lowercase:
> - ConnectX-4 Lx
> - ConnectX-5 Ex
> - ConnectX-6 Dx
> 
> Uppercase of the device family name BlueField is also fixed.
> 
> Signed-off-by: Thomas Monjalon 
Acked-by: Viacheslav Ovsiienko 

> ---
>  config/common_base |  4 ++--
>  doc/guides/nics/mlx5.rst   | 24 
>  doc/guides/rel_notes/release_18_08.rst |  2 +-
> doc/guides/rel_notes/release_19_02.rst |  2 +-
> doc/guides/rel_notes/release_19_11.rst |  2 +-
>  doc/guides/vdpadevs/mlx5.rst   |  6 +++---
>  drivers/net/mlx5/mlx5.c|  2 +-
>  drivers/net/mlx5/mlx5_rxtx.c   |  2 +-
>  drivers/net/mlx5/mlx5_txq.c|  2 +-
>  9 files changed, 23 insertions(+), 23 deletions(-)
> 
> diff --git a/config/common_base b/config/common_base index
> 6ea9c63cc3..7ca2f28b19 100644
> --- a/config/common_base
> +++ b/config/common_base
> @@ -361,13 +361,13 @@ CONFIG_RTE_LIBRTE_MLX4_DEBUG=n
> 
>  #
>  # Compile burst-oriented Mellanox ConnectX-4, ConnectX-5, -# ConnectX-6 &
> Bluefield (MLX5) PMD
> +# ConnectX-6 & BlueField (MLX5) PMD
>  #
>  CONFIG_RTE_LIBRTE_MLX5_PMD=n
>  CONFIG_RTE_LIBRTE_MLX5_DEBUG=n
> 
>  #
> -# Compile vdpa-oriented Mellanox ConnectX-6 & Bluefield (MLX5) PMD
> +# Compile vdpa-oriented Mellanox ConnectX-6 & BlueField (MLX5) PMD
>  #
>  CONFIG_RTE_LIBRTE_MLX5_VDPA_PMD=n
> 
> diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst index
> 5ab7c07165..748a87d8f3 100644
> --- a/doc/guides/nics/mlx5.rst
> +++ b/doc/guides/nics/mlx5.rst
> @@ -7,7 +7,7 @@ MLX5 poll mode driver
> 
>  The MLX5 poll mode driver library (**librte_pmd_mlx5**) provides support
> for **Mellanox ConnectX-4**, **Mellanox ConnectX-4 Lx** , **Mellanox -
> ConnectX-5**, **Mellanox ConnectX-6**, **Mellanox ConnectX-6DX** and
> +ConnectX-5**, **Mellanox ConnectX-6**, **Mellanox ConnectX-6 Dx** and
>  **Mellanox BlueField** families of 10/25/40/50/100/200 Gb/s adapters  as
> well as their virtual functions (VF) in SR-IOV context.
> 
> @@ -324,9 +324,9 @@ Run-time configuration
> 
>Supported on:
> 
> -  - x86_64 with ConnectX-4, ConnectX-4 LX, ConnectX-5, ConnectX-6,
> ConnectX-6 DX
> +  - x86_64 with ConnectX-4, ConnectX-4 Lx, ConnectX-5, ConnectX-6,
> + ConnectX-6 Dx
>  and BlueField.
> -  - POWER9 and ARMv8 with ConnectX-4 LX, ConnectX-5, ConnectX-6,
> ConnectX-6 DX
> +  - POWER9 and ARMv8 with ConnectX-4 Lx, ConnectX-5, ConnectX-6,
> + ConnectX-6 Dx
>  and BlueField.
> 
>  - ``rxq_cqe_pad_en`` parameter [int]
> @@ -357,9 +357,9 @@ Run-time configuration
> 
>Supported on:
> 
> -  - x86_64 with ConnectX-4, ConnectX-4 LX, ConnectX-5, ConnectX-6,
> ConnectX-6 DX
> +  - x86_64 with ConnectX-4, ConnectX-4 Lx, ConnectX-5, ConnectX-6,
> + ConnectX-6 Dx
>  and BlueField.
> -  - POWER8 and ARMv8 with ConnectX-4 LX, ConnectX-5, ConnectX-6,
> ConnectX-6 DX
> +  - POWER8 and ARMv8 with ConnectX-4 Lx, ConnectX-5, ConnectX-6,
> + ConnectX-6 Dx
>  and BlueField.
> 
>  - ``mprq_en`` parameter [int]
> @@ -462,14 +462,14 @@ Run-time configuration
>If ``txq_inline_min`` key is not present, the value may be queried by the
>driver from the NIC via DevX if this feature is available. If there is no 
> DevX
>enabled/supported the value 18 (supposing L2 header including VLAN) is set
> -  for ConnectX-4 and ConnectX-4LX, and 0 is set by default for ConnectX-5
> +  for ConnectX-4 and ConnectX-4 Lx, and 0 is set by default for
> + ConnectX-5
>and newer NICs. If packet is shorter the ``txq_inline_min`` value, the 
> entire
>packet is inlined.
> 
>For ConnectX-4 NIC, driver does not allow specifying value below 18
>(minimal L2 header, including VLAN), error will be raised.
> 
> -  For ConnectX-4LX NIC, it is allowed to specify values below 18, but
> +  For ConnectX-4 Lx NIC, it is allowed to specify values below 18, but
>it is not recommended and may prevent NIC from sending packets over
>some configurations.
> 
> @@ -552,7 +552,7 @@ Run-time configuration
>  - ``txq_mpw_en`` parameter [int]
> 
>A nonzero value enables Enhanced Multi-Packet Write (eMPW) for
> ConnectX-5,
> -  ConnectX-6, ConnectX-6 DX and BlueField. eMPW allows the TX burst
> function to pack
> +  ConnectX-6, ConnectX-6 Dx and BlueField. eMPW allows the TX burst
> + function to pack
>up multiple packets in a single descriptor session in order to save PCI
> bandwidth
>and improve performance at the cost of a slightly higher CPU usage. When
>``txq_inline_mpw`` is set along with ``txq_mpw_en``, TX burst function
> copies @@ -598,7 +598,7 @@ Run-time configur

Re: [dpdk-dev] ABI version of experimental libraries

2020-02-24 Thread Ray Kinsella



On 21/02/2020 16:57, Thomas Monjalon wrote:
> 19/02/2020 14:50, Ray Kinsella:
>> On 19/02/2020 12:43, Thomas Monjalon wrote:
>>> 19/02/2020 12:43, Neil Horman:
 On Tue, Feb 18, 2020 at 10:50:09AM +0100, Thomas Monjalon wrote:
> 18/02/2020 10:42, Bruce Richardson:
>> On Tue, Feb 18, 2020 at 12:15:56AM +0100, Thomas Monjalon wrote:
>>> Hi,
>>>
>>> I would like to remind everybody our mistake when defining ABI versions.
>>> It has been "fixed" in this commit:
>>> http://git.dpdk.org/dpdk/commit/?id=f26c2b39
>>>
>>> Please let's think about the consequence for the experimental libraries.
>>>
>>> In DPDK 19.11, we use the ABI version 0.200 with soname 0.20 In DPDK
>>> 20.02, we use the ABI version 0.2001 with soname 0.201 Numbers are
>>> increasing, that's fine.  When we'll switch to the new major ABI and use
>>> a normal numbering: In DPDK 20.11, we will use the ABI version 0.210 
>>> with
>>> soname 0.21 Numbers are dropping.
>>>
>>> In short, for experimental libs, ABI 20.1 > ABI 21.0
>>>
>>> Are we OK with this or do we prefer reverting to normal numbering for
>>> experimental libraries in DPDK 20.02?
>>>
>> Personally, I would not be too concerned about the verions of 
>> experimental
>> libs, so long as they don't conflict across versions and have some
>> similarity to the major ABI version for the release.
>
> You think sorting of the version numbers is not important?
> If we don't care comparing experimental version numbers,
> then OK, let's drop this patch. But please we need a small vote.
>
> Note: there would be no problem if we did not vote for having
> a special numbering for pure experimental libraries (I am still against).
>
 I don't understand.  Why would we change the ABI_VERSION at all in an LTS 
 release at
 all?  This operation is meant to take an an experimental API and mark it as
 stable by promoting its version number to the next major releases number.  
 As
 such, in the LTS release, we should keep the soname the same, as there 
 should be
 no other ABI changes in the promoted API.
>>>
>>> The library version number is updated because we add new symbols.
>>>
>>>
>>
>> So while experimental library version numbers are not "important".
>> I do agree with Thomas they should be sane, increase and should have a 
>> consistent format.
>>
>> Should we always just pad them to 4 places as the simple solution?
>> i.e.
>>
>> DPDK 19.11 ... 0.20 (needs to remain 0.20).
>> DPDK 20.02 ... 0.2001
>> DPDK 20.11 ... 0.2100
>> DPDK 21.02 ... 0.2101 
> 
> A patch from Ferruh got merged.
> It is adding a dot to keep versioning consistent.
> 
> Marking this patch as rejected.
> 

Ferruh's was a better solution.  


Re: [dpdk-dev] [PATCH v2] vhost: fix packed ring zero-copy

2020-02-24 Thread David Marchand
On Mon, Feb 24, 2020 at 9:28 AM Maxime Coquelin
 wrote:
>
> Hi David & Thomas,
>
> On 2/24/20 4:14 PM, Marvin Liu wrote:
> > Available buffer ID should be stored in the zmbuf in the packed-ring
> > dequeue path. There's no guarantee that local queue avail index is
> > equal to buffer ID.
> >
> > Fixes: d1eafb532268 ("vhost: add packed ring zcopy batch and single 
> > dequeue")
> > Cc: sta...@dpdk.org
> >
> > Signed-off-by: Marvin Liu 
> > Reported-by: Yinan Wang 
> > Reviewed-by: Maxime Coquelin 
> >
>
> If it is not too late, I think we should pick this patch for
> v20.02. It is fixing a regression introduced in DPDK v19.11.

I might have cold feet, but taking this fix right now feels risky.
If the problem has been there since 19.11, it can wait 20.05 and it
will go to 19.11 after proper validation.


-- 
David Marchand



Re: [dpdk-dev] [RFC PATCH 0/5] graph: introduce graph subsystem

2020-02-24 Thread Ray Kinsella



On 22/02/2020 10:24, Jerin Jacob wrote:
> On Sat, Feb 22, 2020 at 3:23 PM Thomas Monjalon  wrote:
>>
>> 22/02/2020 10:05, Jerin Jacob:
>>> On Fri, Feb 21, 2020 at 9:44 PM Thomas Monjalon  wrote:
 21/02/2020 16:56, Jerin Jacob:
> On Fri, Feb 21, 2020 at 4:40 PM Thomas Monjalon  
> wrote:
>> 21/02/2020 11:30, Jerin Jacob:
>>> On Mon, Feb 17, 2020 at 4:28 PM Jerin Jacob  
>>> wrote:
 On Mon, Feb 17, 2020 at 2:08 PM Thomas Monjalon  
 wrote:
> If we add rte_graph to DPDK, we will have 2 similar libraries.
>
> I already proposed several times to move rte_pipeline in a separate
> repository for two reasons:
> 1/ it is acting at a higher API layer level

 We need to define what is the higher layer API. Is it processing 
 beyond L2?
>>
>> My opinion is that any API which is implemented differently
>> for different hardware should be in DPDK.
>
> We need to define SIMD optimization(not HW specific to  but
> architecture-specific)
> treatment as well, as the graph and node library will have SIMD
> optimization as well.

 I think SIMD optimization is generic to any performance-related project,
 not specific to DPDK.


> In general, by the above policy enforced, we need to split DPDK like 
> below,
> dpdk.git
> --
> librte_compressdev
> librte_bbdev
> librte_eventdev
> librte_pci
> librte_rawdev
> librte_eal
> librte_security
> librte_mempool
> librte_mbuf
> librte_cryptodev
> librte_ethdev
>
> other repo(s).
> 
> librte_cmdline
> librte_cfgfile
> librte_bitratestats
> librte_efd
> librte_latencystats
> librte_kvargs
> librte_jobstats
> librte_gso
> librte_gro
> librte_flow_classify
> librte_pipeline
> librte_net
> librte_metrics
> librte_meter
> librte_member
> librte_table
> librte_stack
> librte_sched
> librte_rib
> librte_reorder
> librte_rcu
> librte_power
> librte_distributor
> librte_bpf
> librte_ip_frag
> librte_hash
> librte_fib
> librte_timer
> librte_telemetry
> librte_port
> librte_pdump
> librte_kni
> librte_acl
> librte_vhost
> librte_ring
> librte_lpm
> librte_ipsec

 I think it is a fair conclusion of the scope I am arguing, yes.
>>>
>>> OK. See below.
>>>
>> What is expected to be maintained, tested, etc.
>
> We need to maintain and test other code in OTHER dpdk repo as well.

 Yes but the ones responsible are not the same.
>>>
>>> I see your point. Can I interpret it as you would like to NOT take
>>> responsibility
>>> of  SW libraries(Items enumerated in the second list)?
>>
>> It's not only about me. This is a community decision.
> 
> OK. Let wait for community feedback.
> Probably we discuss more in public TB meeting in 26th Feb.
> 
>>
>>
>>> I think, the main question would be, how it will deliver to distros
>>> and/or end-users
>>> and what will be part of the dpdk release?
>>>
>>> I can think of two options. Maybe distro folks have better view on this.
>>>
>>> options 1:
>>> - Split dpdk to dpdk-core.git, dpdk-algo.git etc based on the
>>> functionalities and maintainer's availability.
>>> - Follow existing release cadence and deliver single release tarball
>>> with content from the above repos.
>>>
>>> options 2:
>>> - Introduce more subtrees(dpdk-next-algo.git etc) based on the
>>> functionalities and maintainer's availability.
>>> - Follow existing release cadence and have a pull request to main
>>> dpdk.git just like Linux kernel or existing scheme of things.
>>>
>>> I am for option 2.
>>>
>>> NOTE: This new graph and node library, I would like to make its new
>>> subtree in the existing scheme of
>>> things so that it will NOT be a burden for you to manage.
>>
>> The option 2 is to make maintainers life easier.
>> Keeping all libraries in the same repository allows to have
>> an unique release and a central place for the apps and docs.
>>
>> The option 1 may make contributors life easier if we consider
>> adding new libraries can make contributions harder in case of dependencies.
>> The option 1 makes also repositories smaller, so maybe easier to approach.
>> It makes easier to fully validate testing and quality of a repository.
>> Having separate packages makes easier to select what is distributed and 
>> supported.
> 
> If the final dpdk release tarball looks same for option1 and option2
> then I think,
> option 1 is overhead to manage intra repo dependency.
> 
> I agree with Thomas, it  is better to decide as a community what
> direction we need
> to take and align existing and new libraries with that scheme.
> 

+1 to Option 2.
As Jerin points out, it has allowed other larger communities to scale 
effectively.

> 
>>
>> After years thinking about the scop

Re: [dpdk-dev] [PATCH v2] vhost: fix packed ring zero-copy

2020-02-24 Thread Maxime Coquelin



On 2/24/20 11:32 AM, David Marchand wrote:
> On Mon, Feb 24, 2020 at 9:28 AM Maxime Coquelin
>  wrote:
>>
>> Hi David & Thomas,
>>
>> On 2/24/20 4:14 PM, Marvin Liu wrote:
>>> Available buffer ID should be stored in the zmbuf in the packed-ring
>>> dequeue path. There's no guarantee that local queue avail index is
>>> equal to buffer ID.
>>>
>>> Fixes: d1eafb532268 ("vhost: add packed ring zcopy batch and single 
>>> dequeue")
>>> Cc: sta...@dpdk.org
>>>
>>> Signed-off-by: Marvin Liu 
>>> Reported-by: Yinan Wang 
>>> Reviewed-by: Maxime Coquelin 
>>>
>>
>> If it is not too late, I think we should pick this patch for
>> v20.02. It is fixing a regression introduced in DPDK v19.11.
> 
> I might have cold feet, but taking this fix right now feels risky.
> If the problem has been there since 19.11, it can wait 20.05 and it
> will go to 19.11 after proper validation.

Ok, I get your point and it's your call.  Now, this fix is really
isolated to zero-copy packed ring, so the only risk IMO is to break
something that is already not working.

Thanks,
Maxime



[dpdk-dev] [RFC 0/6] New sync modes for ring

2020-02-24 Thread Konstantin Ananyev
Upfront note - that RFC is not a complete patch.
It introduces an ABI breakage, plus it doesn't update ring_elem
code properly, etc.
I plan to deal with all these things in later versions.
Right now I seek an initial feedback about proposed ideas.
Would also ask people to repeat performance tests (see below)
on their platforms to confirm the impact.

More and more customers use(/try to use) DPDK based apps within
overcommitted systems (multiple acttive threads over same pysical cores):
VM, container deployments, etc.
One quite common problem they hit: Lock-Holder-Preemption with rte_ring.
LHP is quite a common problem for spin-based sync primitives
(spin-locks, etc.) on overcommitted systems.
The situation gets much worse when some sort of
fair-locking technique is used (ticket-lock, etc.).
As now not only lock-owner but also lock-waiters scheduling
order matters a lot.
This is a well-known problem for kernel within VMs:
http://www-archive.xenproject.org/files/xensummitboston08/LHP.pdf
https://www.cs.hs-rm.de/~kaiser/events/wamos2017/Slides/selcuk.pdf
The problem with rte_ring is that while head accusion is sort of
un-fair locking, waiting on tail is very similar to ticket lock schema -
tail has to be updated in particular order.
That makes current rte_ring implementation to perform
really pure on some overcommited scenarios.
While it is probably not possible to completely resolve this problem in
userspace only (without some kernel communication/intervention),
removing fairness in tail update can mitigate it significantly.
So this RFC proposes two new optional ring synchronization modes:
1) Head/Tail Sync (HTS) mode
In that mode enqueue/dequeue operation is fully serialized:
only one thread at a time is allowed to perform given op.
As another enhancement provide ability to split enqueue/dequeue
operation into two phases:
  - enqueue/dequeue start
  - enqueue/dequeue finish
That allows user to inspect objects in the ring without removing
them from it (aka MT safe peek).
2) Relaxed Tail Sync (RTS)
The main difference from original MP/MC algorithm is that
tail value is increased not by every thread that finished enqueue/dequeue,
but only by the last one.
That allows threads to avoid spinning on ring tail value,
leaving actual tail value change to the last thread in the update queue.

Test results on IA (see below) show significant improvements
for average enqueue/dequeue op times on overcommitted systems.
For 'classic' DPDK deployments (one thread per core) original MP/MC
algorithm still shows best numbers, though for 64-bit target
RTS numbers are not that far away.
Numbers were produced by ring_stress_*autotest
(first patch in these series).

X86_64 @ Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz
DEQ+ENQ average cycles/obj

MP/MC  HTS RTS
1thread@1core(--lcores=6-7) 8.00   8.158.99
2thread@2core(--lcores=6-8) 19.14  19.61   20.35
4thread@4core(--lcores=6-10)29.43  29.79   31.82
8thread@8core(--lcores=6-14)110.59 192.81  119.50
16thread@16core(--lcores=6-22)  461.03 813.12  495.59
32thread/@32core(--lcores='6-22,55-70') 982.90 1972.38 1160.51

2thread@1core(--lcores='6,(10-11)@7'20140.50   23.58   25.14
4thread@2core(--lcores='6,(10-11)@7,(20-21)@8'  153680.60  76.88   80.05
8thread@2core(--lcores='6,(10-13)@7,(20-23)@8'  280314.32  294.72  318.79
16thread@2core(--lcores='6,(10-17)@7,(20-27)@8' 643176.59  1144.02 1175.14
32thread@2core(--lcores='6,(10-25)@7,(30-45)@8' 4264238.80 4627.48 4892.68

8thread@2core(--lcores='6,(10-17)@(7,8))'   321085.98  298.59  307.47
16thread@4core(--lcores='6,(20-35)@(7-10))' 1900705.61 575.35  678.29
32thread@4core(--lcores='6,(20-51)@(7-10))' 5510445.85 2164.36 2714.12

i686 @ Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz
DEQ+ENQ average cycles/obj

MP/MC  HTS RTS
1thread@1core(--lcores=6-7) 7.85   12.13   11.31
2thread@2core(--lcores=6-8) 17.89  24.52   21.86
8thread@8core(--lcores=6-14)32.58  354.20  54.58
32thread/@32core(--lcores='6-22,55-70') 813.77 6072.41 2169.91

2thread@1core(--lcores='6,(10-11)@7'16095.00   36.06   34.74
8thread@2core(--lcores='6,(10-13)@7,(20-23)@8'  1140354.54 346.61  361.57
16thread@2core(--lcores='6,(10-17)@7,(20-27)@8' 1920417.86 1314.90 1416.65

8thread@2core(--lcores='6,(10-17)@(7,8))'   594358.61  332.70  357.74
32thread@4core(--lcores='6,(20-51)@(7-10))' 5319896.86 2836.44 3028.87

Konstantin Ananyev (6):
  test/ring: add contention stress test
  ring: rework ring layout to allow new sync schemes
  ring: introduce RTS ring mode
  test/ring: add contention stress test for RTS ring
  ring: introduce HTS ring mode
  test/ring: add contention stress test

[dpdk-dev] [RFC 1/6] test/ring: add contention stress test

2020-02-24 Thread Konstantin Ananyev
Introduce new test-case to measure ring perfomance under contention
(miltiple producers/consumers).
Starts dequeue/enqueue loop on all available slave lcores.

Signed-off-by: Konstantin Ananyev 
---
 app/test/Makefile   |   1 +
 app/test/meson.build|   1 +
 app/test/test_ring_stress.c |  27 ++
 app/test/test_ring_stress.h | 477 
 4 files changed, 506 insertions(+)
 create mode 100644 app/test/test_ring_stress.c
 create mode 100644 app/test/test_ring_stress.h

diff --git a/app/test/Makefile b/app/test/Makefile
index 1f080d162..4f586d95f 100644
--- a/app/test/Makefile
+++ b/app/test/Makefile
@@ -78,6 +78,7 @@ SRCS-y += test_rand_perf.c
 
 SRCS-y += test_ring.c
 SRCS-y += test_ring_perf.c
+SRCS-y += test_ring_stress.c
 SRCS-y += test_pmd_perf.c
 
 ifeq ($(CONFIG_RTE_LIBRTE_TABLE),y)
diff --git a/app/test/meson.build b/app/test/meson.build
index 0a2ce710f..84dde28ad 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -101,6 +101,7 @@ test_sources = files('commands.c',
'test_rib6.c',
'test_ring.c',
'test_ring_perf.c',
+   'test_ring_stress.c',
'test_rwlock.c',
'test_sched.c',
'test_service_cores.c',
diff --git a/app/test/test_ring_stress.c b/app/test/test_ring_stress.c
new file mode 100644
index 0..5689e06c8
--- /dev/null
+++ b/app/test/test_ring_stress.c
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#include "test_ring_stress.h"
+
+static inline uint32_t
+_st_ring_dequeue_bulk(struct rte_ring *r, void **obj, uint32_t n,
+   uint32_t *avail)
+{
+   return rte_ring_mc_dequeue_bulk(r, obj, n, avail);
+}
+
+static inline uint32_t
+_st_ring_enqueue_bulk(struct rte_ring *r, void * const *obj, uint32_t n,
+   uint32_t *free)
+{
+   return rte_ring_mp_enqueue_bulk(r, obj, n, free);
+}
+
+static int
+_st_ring_init(struct rte_ring *r, const char *name, uint32_t num)
+{
+   return rte_ring_init(r, name, num, 0);
+}
+
+REGISTER_TEST_COMMAND(ring_stress_autotest, test_ring_stress);
diff --git a/app/test/test_ring_stress.h b/app/test/test_ring_stress.h
new file mode 100644
index 0..c6f0bc9f1
--- /dev/null
+++ b/app/test/test_ring_stress.h
@@ -0,0 +1,477 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "test.h"
+
+/*
+ * Measures performance of ring enqueue/dequeue under high contention
+ */
+
+#define RING_NAME  "RING_STRESS"
+#define BULK_NUM   32
+#define RING_SIZE  (2 * BULK_NUM * RTE_MAX_LCORE)
+
+enum {
+   WRK_CMD_STOP,
+   WRK_CMD_RUN,
+};
+
+static volatile uint32_t wrk_cmd __rte_cache_aligned;
+
+/* test run-time in seconds */
+static const uint32_t run_time = 60;
+
+struct lcore_stat {
+   uint64_t nb_cycle;
+   struct {
+   uint64_t nb_call;
+   uint64_t nb_obj;
+   uint64_t nb_cycle;
+   uint64_t max_cycle;
+   uint64_t min_cycle;
+   } op;
+};
+
+struct lcore_arg {
+   struct rte_ring *rng;
+   struct lcore_stat stats;
+} __rte_cache_aligned;
+
+struct ring_elem {
+   uint32_t cnt[RTE_CACHE_LINE_SIZE / sizeof(uint32_t)];
+} __rte_cache_aligned;
+
+/*
+ * redefinable functions
+ */
+static uint32_t
+_st_ring_dequeue_bulk(struct rte_ring *r, void **obj, uint32_t n,
+   uint32_t *avail);
+
+static uint32_t
+_st_ring_enqueue_bulk(struct rte_ring *r, void * const *obj, uint32_t n,
+   uint32_t *free);
+
+static int
+_st_ring_init(struct rte_ring *r, const char *name, uint32_t num);
+
+
+static void
+lcore_stat_update(struct lcore_stat *ls, uint64_t call, uint64_t obj,
+   uint64_t tm, int32_t prcs)
+{
+   ls->op.nb_call += call;
+   ls->op.nb_obj += obj;
+   ls->op.nb_cycle += tm;
+   if (prcs) {
+   ls->op.max_cycle = RTE_MAX(ls->op.max_cycle, tm);
+   ls->op.min_cycle = RTE_MIN(ls->op.min_cycle, tm);
+   }
+}
+
+static void
+lcore_op_stat_aggr(struct lcore_stat *ms, const struct lcore_stat *ls)
+{
+
+   ms->op.nb_call += ls->op.nb_call;
+   ms->op.nb_obj += ls->op.nb_obj;
+   ms->op.nb_cycle += ls->op.nb_cycle;
+   ms->op.max_cycle = RTE_MAX(ms->op.max_cycle, ls->op.max_cycle);
+   ms->op.min_cycle = RTE_MIN(ms->op.min_cycle, ls->op.min_cycle);
+}
+
+static void
+lcore_stat_aggr(struct lcore_stat *ms, const struct lcore_stat *ls)
+{
+   ms->nb_cycle = RTE_MAX(ms->nb_cycle, ls->nb_cycle);
+   lcore_op_stat_aggr(ms, ls);
+}
+
+static void
+lcore_stat_dump(FILE *f, uint32_t lc, const struct lcore_stat *ls)
+{
+   long double st;
+
+   st = (long double)rte_get_timer_hz() / US_PER_S;
+
+   if (lc == UINT32_MAX)
+   fprintf(f, "%s(AGGREGATE)={\n", __func__);
+   else
+   fprintf(f, "%s(lc=%u)={\n", 

[dpdk-dev] [RFC 3/6] ring: introduce RTS ring mode

2020-02-24 Thread Konstantin Ananyev
Introduce relaxed tail sync (RTS) mode for MT ring synchronization.
Aim to reduce stall times in case when ring is used on
overcommited cpus (multiple active threads on the same cpu).
The main difference from original MP/MC algorithm is that
tail value is increased not by every thread that finished enqueue/dequeue,
but only by the last one.
That allows threads to avoid spinning on ring tail value,
leaving actual tail value change to the last thread in the update queue.

Signed-off-by: Konstantin Ananyev 
---
 lib/librte_ring/Makefile   |   3 +-
 lib/librte_ring/meson.build|   3 +-
 lib/librte_ring/rte_ring.c |  75 ++-
 lib/librte_ring/rte_ring.h | 300 +++--
 lib/librte_ring/rte_ring_rts_generic.h | 240 
 5 files changed, 598 insertions(+), 23 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_rts_generic.h

diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
index 917c560ad..4f90344f4 100644
--- a/lib/librte_ring/Makefile
+++ b/lib/librte_ring/Makefile
@@ -18,6 +18,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
 SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
rte_ring_elem.h \
rte_ring_generic.h \
-   rte_ring_c11_mem.h
+   rte_ring_c11_mem.h \
+   rte_ring_rts_generic.h
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_ring/meson.build b/lib/librte_ring/meson.build
index f2f3ccc88..dc8d7dbea 100644
--- a/lib/librte_ring/meson.build
+++ b/lib/librte_ring/meson.build
@@ -5,7 +5,8 @@ sources = files('rte_ring.c')
 headers = files('rte_ring.h',
'rte_ring_elem.h',
'rte_ring_c11_mem.h',
-   'rte_ring_generic.h')
+   'rte_ring_generic.h',
+   'rte_ring_rts_generic.h')
 
 # rte_ring_create_elem and rte_ring_get_memsize_elem are experimental
 allow_experimental_apis = true
diff --git a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c
index fa5733907..1ce0af3e5 100644
--- a/lib/librte_ring/rte_ring.c
+++ b/lib/librte_ring/rte_ring.c
@@ -45,6 +45,9 @@ EAL_REGISTER_TAILQ(rte_ring_tailq)
 /* true if x is a power of 2 */
 #define POWEROF2(x) x)-1) & (x)) == 0)
 
+/* by default set head/tail distance as 1/8 of ring capacity */
+#define HTD_MAX_DEF8
+
 /* return the size of memory occupied by a ring */
 ssize_t
 rte_ring_get_memsize_elem(unsigned int esize, unsigned int count)
@@ -82,8 +85,56 @@ rte_ring_get_memsize(unsigned int count)
 void
 rte_ring_reset(struct rte_ring *r)
 {
-   r->prod.head = r->cons.head = 0;
-   r->prod.tail = r->cons.tail = 0;
+   memset((void *)(uintptr_t)&r->prod.tail, 0,
+   offsetof(struct rte_ring, pad1) -
+   offsetof(struct rte_ring, prod.tail));
+   memset((void *)(uintptr_t)&r->cons.tail, 0,
+   offsetof(struct rte_ring, pad2) -
+   offsetof(struct rte_ring, cons.tail));
+}
+
+/*
+ * helper function, calculates sync_type values for prod and cons
+ * based on input flags. Returns zero at success or negative
+ * errno value otherwise.
+ */
+static int
+get_sync_type(uint32_t flags, uint32_t *prod_st, uint32_t *cons_st)
+{
+   static const uint32_t prod_st_flags =
+   (RING_F_SP_ENQ | RING_F_MP_RTS_ENQ);
+   static const uint32_t cons_st_flags =
+   (RING_F_SC_DEQ | RING_F_MC_RTS_DEQ);
+
+   switch (flags & prod_st_flags) {
+   case 0:
+   *prod_st = RTE_RING_SYNC_MT;
+   break;
+   case RING_F_SP_ENQ:
+   *prod_st = RTE_RING_SYNC_ST;
+   break;
+   case RING_F_MP_RTS_ENQ:
+   *prod_st = RTE_RING_SYNC_MT_RTS;
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   switch (flags & cons_st_flags) {
+   case 0:
+   *cons_st = RTE_RING_SYNC_MT;
+   break;
+   case RING_F_SC_DEQ:
+   *cons_st = RTE_RING_SYNC_ST;
+   break;
+   case RING_F_MC_RTS_DEQ:
+   *cons_st = RTE_RING_SYNC_MT_RTS;
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
 }
 
 int
@@ -100,16 +151,20 @@ rte_ring_init(struct rte_ring *r, const char *name, 
unsigned count,
RTE_BUILD_BUG_ON((offsetof(struct rte_ring, prod) &
  RTE_CACHE_LINE_MASK) != 0);
 
+   RTE_BUILD_BUG_ON(offsetof(struct rte_ring_headtail, sync_type) !=
+   offsetof(struct rte_ring_rts_headtail, sync_type));
+   RTE_BUILD_BUG_ON(offsetof(struct rte_ring_headtail, tail) !=
+   offsetof(struct rte_ring_rts_headtail, tail.val.pos));
+
/* init the ring structure */
memset(r, 0, sizeof(*r));
ret = strlcpy(r->name, name, sizeof(r->name));
if (ret < 0

[dpdk-dev] [RFC 5/6] ring: introduce HTS ring mode

2020-02-24 Thread Konstantin Ananyev
Introduce head/tail sync mode for MT ring synchronization.
In that mode enqueue/dequeue operation is fully serialized:
only one thread at a time is allowed to perform given op.
Suppose to reduce stall times in case when ring is used on
overcommitted cpus (multiple active threads on the same cpu).
As another enhancement provide ability to split enqueue/dequeue
operation into two phases:
  - enqueue/dequeue start
  - enqueue/dequeue finish
That allows user to inspect objects in the ring without removing
them from it (aka MT safe peek).

Signed-off-by: Konstantin Ananyev 
---
 lib/librte_ring/Makefile   |   1 +
 lib/librte_ring/meson.build|   1 +
 lib/librte_ring/rte_ring.c |  15 +-
 lib/librte_ring/rte_ring.h | 259 -
 lib/librte_ring/rte_ring_hts_generic.h | 228 ++
 5 files changed, 500 insertions(+), 4 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_hts_generic.h

diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
index 4f90344f4..0c7f8f918 100644
--- a/lib/librte_ring/Makefile
+++ b/lib/librte_ring/Makefile
@@ -19,6 +19,7 @@ SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
rte_ring_elem.h \
rte_ring_generic.h \
rte_ring_c11_mem.h \
+   rte_ring_hts_generic.h \
rte_ring_rts_generic.h
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_ring/meson.build b/lib/librte_ring/meson.build
index dc8d7dbea..5aa673199 100644
--- a/lib/librte_ring/meson.build
+++ b/lib/librte_ring/meson.build
@@ -6,6 +6,7 @@ headers = files('rte_ring.h',
'rte_ring_elem.h',
'rte_ring_c11_mem.h',
'rte_ring_generic.h',
+   'rte_ring_hts_generic.h',
'rte_ring_rts_generic.h')
 
 # rte_ring_create_elem and rte_ring_get_memsize_elem are experimental
diff --git a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c
index 1ce0af3e5..d3b948667 100644
--- a/lib/librte_ring/rte_ring.c
+++ b/lib/librte_ring/rte_ring.c
@@ -102,9 +102,9 @@ static int
 get_sync_type(uint32_t flags, uint32_t *prod_st, uint32_t *cons_st)
 {
static const uint32_t prod_st_flags =
-   (RING_F_SP_ENQ | RING_F_MP_RTS_ENQ);
+   (RING_F_SP_ENQ | RING_F_MP_RTS_ENQ | RING_F_MP_HTS_ENQ);
static const uint32_t cons_st_flags =
-   (RING_F_SC_DEQ | RING_F_MC_RTS_DEQ);
+   (RING_F_SC_DEQ | RING_F_MC_RTS_DEQ | RING_F_MC_HTS_DEQ);
 
switch (flags & prod_st_flags) {
case 0:
@@ -116,6 +116,9 @@ get_sync_type(uint32_t flags, uint32_t *prod_st, uint32_t 
*cons_st)
case RING_F_MP_RTS_ENQ:
*prod_st = RTE_RING_SYNC_MT_RTS;
break;
+   case RING_F_MP_HTS_ENQ:
+   *prod_st = RTE_RING_SYNC_MT_HTS;
+   break;
default:
return -EINVAL;
}
@@ -130,6 +133,9 @@ get_sync_type(uint32_t flags, uint32_t *prod_st, uint32_t 
*cons_st)
case RING_F_MC_RTS_DEQ:
*cons_st = RTE_RING_SYNC_MT_RTS;
break;
+   case RING_F_MC_HTS_DEQ:
+   *cons_st = RTE_RING_SYNC_MT_HTS;
+   break;
default:
return -EINVAL;
}
@@ -151,6 +157,11 @@ rte_ring_init(struct rte_ring *r, const char *name, 
unsigned count,
RTE_BUILD_BUG_ON((offsetof(struct rte_ring, prod) &
  RTE_CACHE_LINE_MASK) != 0);
 
+   RTE_BUILD_BUG_ON(offsetof(struct rte_ring_headtail, sync_type) !=
+   offsetof(struct rte_ring_hts_headtail, sync_type));
+   RTE_BUILD_BUG_ON(offsetof(struct rte_ring_headtail, tail) !=
+   offsetof(struct rte_ring_hts_headtail, ht.pos.tail));
+
RTE_BUILD_BUG_ON(offsetof(struct rte_ring_headtail, sync_type) !=
offsetof(struct rte_ring_rts_headtail, sync_type));
RTE_BUILD_BUG_ON(offsetof(struct rte_ring_headtail, tail) !=
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index a130aeb9d..52edcea11 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -66,11 +66,11 @@ enum {
RTE_RING_SYNC_MT, /**< multi-thread safe (default mode) */
RTE_RING_SYNC_ST, /**< single thread only */
RTE_RING_SYNC_MT_RTS, /**< multi-thread relaxed tail sync */
+   RTE_RING_SYNC_MT_HTS, /**< multi-thread head/tail sync */
 };
 
 /**
- * structure to hold a pair of head/tail values and other metadata.
- * used by RTE_RING_SYNC_MT, RTE_RING_SYNC_ST sync types.
+ * Structure to hold a pair of head/tail values and other metadata.
  * Depending on sync_type format of that structure might differ
  * depending on the sync mechanism selelcted, but offsets for
  * *sync_type* and *tail* values should always remain the same.
@@ -96

[dpdk-dev] [RFC 2/6] ring: rework ring layout to allow new sync schemes

2020-02-24 Thread Konstantin Ananyev
Change from *single* to *sync_type* to allow different
synchronisation schemes to be applied.
Change layout to make sure that *sync_type* and *tail*
will always reside on same offsets.

Signed-off-by: Konstantin Ananyev 
---
 app/test/test_pdump.c   |  6 +--
 lib/librte_pdump/rte_pdump.c|  2 +-
 lib/librte_port/rte_port_ring.c | 12 +++---
 lib/librte_ring/rte_ring.c  |  6 ++-
 lib/librte_ring/rte_ring.h  | 76 -
 lib/librte_ring/rte_ring_elem.h |  8 ++--
 6 files changed, 71 insertions(+), 39 deletions(-)

diff --git a/app/test/test_pdump.c b/app/test/test_pdump.c
index ad183184c..6a1180bcb 100644
--- a/app/test/test_pdump.c
+++ b/app/test/test_pdump.c
@@ -57,8 +57,7 @@ run_pdump_client_tests(void)
if (ret < 0)
return -1;
mp->flags = 0x;
-   ring_client = rte_ring_create("SR0", RING_SIZE, rte_socket_id(),
- RING_F_SP_ENQ | RING_F_SC_DEQ);
+   ring_client = rte_ring_create("SR0", RING_SIZE, rte_socket_id(), 0);
if (ring_client == NULL) {
printf("rte_ring_create SR0 failed");
return -1;
@@ -71,9 +70,6 @@ run_pdump_client_tests(void)
}
rte_eth_dev_probing_finish(eth_dev);
 
-   ring_client->prod.single = 0;
-   ring_client->cons.single = 0;
-
printf("\n* flags = RTE_PDUMP_FLAG_TX *\n");
 
for (itr = 0; itr < NUM_ITR; itr++) {
diff --git a/lib/librte_pdump/rte_pdump.c b/lib/librte_pdump/rte_pdump.c
index 8a01ac510..65364f2c5 100644
--- a/lib/librte_pdump/rte_pdump.c
+++ b/lib/librte_pdump/rte_pdump.c
@@ -380,7 +380,7 @@ pdump_validate_ring_mp(struct rte_ring *ring, struct 
rte_mempool *mp)
rte_errno = EINVAL;
return -1;
}
-   if (ring->prod.single || ring->cons.single) {
+   if (rte_ring_prod_single(ring) || rte_ring_cons_single(ring)) {
PDUMP_LOG(ERR, "ring with either SP or SC settings"
" is not valid for pdump, should have MP and MC settings\n");
rte_errno = EINVAL;
diff --git a/lib/librte_port/rte_port_ring.c b/lib/librte_port/rte_port_ring.c
index 47fcdd06a..2f6c050fa 100644
--- a/lib/librte_port/rte_port_ring.c
+++ b/lib/librte_port/rte_port_ring.c
@@ -44,8 +44,8 @@ rte_port_ring_reader_create_internal(void *params, int 
socket_id,
/* Check input parameters */
if ((conf == NULL) ||
(conf->ring == NULL) ||
-   (conf->ring->cons.single && is_multi) ||
-   (!(conf->ring->cons.single) && !is_multi)) {
+   (rte_ring_cons_single(conf->ring) && is_multi) ||
+   (!rte_ring_cons_single(conf->ring) && !is_multi)) {
RTE_LOG(ERR, PORT, "%s: Invalid Parameters\n", __func__);
return NULL;
}
@@ -171,8 +171,8 @@ rte_port_ring_writer_create_internal(void *params, int 
socket_id,
/* Check input parameters */
if ((conf == NULL) ||
(conf->ring == NULL) ||
-   (conf->ring->prod.single && is_multi) ||
-   (!(conf->ring->prod.single) && !is_multi) ||
+   (rte_ring_prod_single(conf->ring) && is_multi) ||
+   (!rte_ring_prod_single(conf->ring) && !is_multi) ||
(conf->tx_burst_sz > RTE_PORT_IN_BURST_SIZE_MAX)) {
RTE_LOG(ERR, PORT, "%s: Invalid Parameters\n", __func__);
return NULL;
@@ -440,8 +440,8 @@ rte_port_ring_writer_nodrop_create_internal(void *params, 
int socket_id,
/* Check input parameters */
if ((conf == NULL) ||
(conf->ring == NULL) ||
-   (conf->ring->prod.single && is_multi) ||
-   (!(conf->ring->prod.single) && !is_multi) ||
+   (rte_ring_prod_single(conf->ring) && is_multi) ||
+   (!rte_ring_prod_single(conf->ring) && !is_multi) ||
(conf->tx_burst_sz > RTE_PORT_IN_BURST_SIZE_MAX)) {
RTE_LOG(ERR, PORT, "%s: Invalid Parameters\n", __func__);
return NULL;
diff --git a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c
index 77e5de099..fa5733907 100644
--- a/lib/librte_ring/rte_ring.c
+++ b/lib/librte_ring/rte_ring.c
@@ -106,8 +106,10 @@ rte_ring_init(struct rte_ring *r, const char *name, 
unsigned count,
if (ret < 0 || ret >= (int)sizeof(r->name))
return -ENAMETOOLONG;
r->flags = flags;
-   r->prod.single = (flags & RING_F_SP_ENQ) ? __IS_SP : __IS_MP;
-   r->cons.single = (flags & RING_F_SC_DEQ) ? __IS_SC : __IS_MC;
+   r->prod.sync_type = (flags & RING_F_SP_ENQ) ?
+   RTE_RING_SYNC_ST : RTE_RING_SYNC_MT;
+   r->cons.sync_type = (flags & RING_F_SC_DEQ) ?
+   RTE_RING_SYNC_ST : RTE_RING_SYNC_MT;
 
if (flags & RING_F_EXACT_SZ) {
r->size = rte_align32pow2(count + 1);
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 18

[dpdk-dev] [RFC 6/6] test/ring: add contention stress test for HTS ring

2020-02-24 Thread Konstantin Ananyev
Introduce new test case to test HTS ring mode under contention.

Signed-off-by: Konstantin Ananyev 
---
 app/test/Makefile   |  1 +
 app/test/meson.build|  1 +
 app/test/test_ring_hts_stress.c | 28 
 3 files changed, 30 insertions(+)
 create mode 100644 app/test/test_ring_hts_stress.c

diff --git a/app/test/Makefile b/app/test/Makefile
index d22b9f702..ff151cd55 100644
--- a/app/test/Makefile
+++ b/app/test/Makefile
@@ -77,6 +77,7 @@ SRCS-y += test_external_mem.c
 SRCS-y += test_rand_perf.c
 
 SRCS-y += test_ring.c
+SRCS-y += test_ring_hts_stress.c
 SRCS-y += test_ring_perf.c
 SRCS-y += test_ring_rts_stress.c
 SRCS-y += test_ring_stress.c
diff --git a/app/test/meson.build b/app/test/meson.build
index fa4fb4b51..7e58fa999 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -100,6 +100,7 @@ test_sources = files('commands.c',
'test_rib.c',
'test_rib6.c',
'test_ring.c',
+   'test_ring_hts_stress.c',
'test_ring_perf.c',
'test_ring_rts_stress.c',
'test_ring_stress.c',
diff --git a/app/test/test_ring_hts_stress.c b/app/test/test_ring_hts_stress.c
new file mode 100644
index 0..b7f2d21fc
--- /dev/null
+++ b/app/test/test_ring_hts_stress.c
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#include "test_ring_stress.h"
+
+static inline uint32_t
+_st_ring_dequeue_bulk(struct rte_ring *r, void **obj, uint32_t n,
+   uint32_t *avail)
+{
+   return rte_ring_hts_dequeue_bulk(r, obj, n, avail);
+}
+
+static inline uint32_t
+_st_ring_enqueue_bulk(struct rte_ring *r, void * const *obj, uint32_t n,
+   uint32_t *free)
+{
+   return rte_ring_hts_enqueue_bulk(r, obj, n, free);
+}
+
+static int
+_st_ring_init(struct rte_ring *r, const char *name, uint32_t num)
+{
+   return rte_ring_init(r, name, num,
+   RING_F_MP_HTS_ENQ | RING_F_MC_HTS_DEQ);
+}
+
+REGISTER_TEST_COMMAND(ring_stress_hts_autotest, test_ring_stress);
-- 
2.17.1



[dpdk-dev] [RFC 4/6] test/ring: add contention stress test for RTS ring

2020-02-24 Thread Konstantin Ananyev
Introduce new test case to test RTS ring mode under contention.

Signed-off-by: Konstantin Ananyev 
---
 app/test/Makefile   |  1 +
 app/test/meson.build|  1 +
 app/test/test_ring_rts_stress.c | 28 
 3 files changed, 30 insertions(+)
 create mode 100644 app/test/test_ring_rts_stress.c

diff --git a/app/test/Makefile b/app/test/Makefile
index 4f586d95f..d22b9f702 100644
--- a/app/test/Makefile
+++ b/app/test/Makefile
@@ -78,6 +78,7 @@ SRCS-y += test_rand_perf.c
 
 SRCS-y += test_ring.c
 SRCS-y += test_ring_perf.c
+SRCS-y += test_ring_rts_stress.c
 SRCS-y += test_ring_stress.c
 SRCS-y += test_pmd_perf.c
 
diff --git a/app/test/meson.build b/app/test/meson.build
index 84dde28ad..fa4fb4b51 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -101,6 +101,7 @@ test_sources = files('commands.c',
'test_rib6.c',
'test_ring.c',
'test_ring_perf.c',
+   'test_ring_rts_stress.c',
'test_ring_stress.c',
'test_rwlock.c',
'test_sched.c',
diff --git a/app/test/test_ring_rts_stress.c b/app/test/test_ring_rts_stress.c
new file mode 100644
index 0..525d222b2
--- /dev/null
+++ b/app/test/test_ring_rts_stress.c
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#include "test_ring_stress.h"
+
+static inline uint32_t
+_st_ring_dequeue_bulk(struct rte_ring *r, void **obj, uint32_t n,
+   uint32_t *avail)
+{
+   return rte_ring_rts_dequeue_bulk(r, obj, n, avail);
+}
+
+static inline uint32_t
+_st_ring_enqueue_bulk(struct rte_ring *r, void * const *obj, uint32_t n,
+   uint32_t *free)
+{
+   return rte_ring_rts_enqueue_bulk(r, obj, n, free);
+}
+
+static int
+_st_ring_init(struct rte_ring *r, const char *name, uint32_t num)
+{
+   return rte_ring_init(r, name, num,
+   RING_F_MP_RTS_ENQ | RING_F_MC_RTS_DEQ);
+}
+
+REGISTER_TEST_COMMAND(ring_stress_rts_autotest, test_ring_stress);
-- 
2.17.1



[dpdk-dev] [PATCH v2] devtools: export title syntax data for check-git-log

2020-02-24 Thread Ferruh Yigit
From: Sean Morrissey 

Moved title syntax to a separate file so that it improves code
readability and allows easy addition.

Also logic changed from checking for bad pattern to checking good
pattern which documents the expected syntax more clearly, and does not
have gaps in the checks.

Signed-off-by: Sean Morrissey 
Signed-off-by: Ferruh Yigit 
---
v2:
* renamed data file to words-case.txt and added to MAINTAINERS file
* Updated script
* as of now vdpa as file prefix is giving false positive
---
 MAINTAINERS   |  1 +
 devtools/check-git-log.sh | 63 ---
 devtools/words-case.txt   | 48 +
 3 files changed, 61 insertions(+), 51 deletions(-)
 create mode 100644 devtools/words-case.txt

diff --git a/MAINTAINERS b/MAINTAINERS
index 1886b18c3..923b49680 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -95,6 +95,7 @@ F: devtools/check-dup-includes.sh
 F: devtools/check-maintainers.sh
 F: devtools/check-forbidden-tokens.awk
 F: devtools/check-git-log.sh
+F: devtools/words-case.txt
 F: devtools/check-includes.sh
 F: devtools/check-symbol-maps.sh
 F: devtools/checkpatches.sh
diff --git a/devtools/check-git-log.sh b/devtools/check-git-log.sh
index f9d055039..2ee5eefd7 100755
--- a/devtools/check-git-log.sh
+++ b/devtools/check-git-log.sh
@@ -83,57 +83,18 @@ bad=$(echo "$headlines" | grep --color=always \
| sed 's,^,\t,')
 [ -z "$bad" ] || printf "Wrong headline uppercase:\n$bad\n"
 
-# check headline uppercase (Rx/Tx, VF, L2, MAC, Linux, ARM...)
-bad=$(echo "$headlines" | grep -E --color=always \
-   -e ':.*\<(rx|tx|RX|TX)\>' \
-   -e ':.*\<[pv]f\>' \
-   -e ':.*\<[hsf]w\>' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\<(Aarch64|AArch64|AARCH64|Aarch32|AArch32|AARCH32)\>' \
-   -e ':.*\<(Armv7|ARMv7|ArmV7|armV7|ARMV7)\>' \
-   -e ':.*\<(Armv8|ARMv8|ArmV8|armV8|ARMV8)\>' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\<[Vv]lan\>' \
-   -e ':.*\' \
-   -e ':.*\' \
-   | grep \
-   -v ':.*\' \
-   | sed 's,^,\t,')
-[ -z "$bad" ] || printf "Wrong headline lowercase:\n$bad\n"
-
-# special case check for VMDq to give good error message
-bad=$(echo "$headlines" | grep -E --color=always \
-   -e '\<(vmdq|VMDQ)\>' \
-   | sed 's,^,\t,')
-[ -z "$bad" ] || printf "Wrong headline capitalization, use 'VMDq':\n$bad\n"
+# check headline case (Rx/Tx, VF, L2, MAC, Linux ...)
+words="$selfdir/words-case.txt"
+for word in $(cat $words); do
+   bad=$(echo "$headlines" | grep -iw $word | grep -v $word)
+   if [ "$word" = "Tx" ]; then
+   bad=$(echo $bad | grep -v 'OCTEON\ TX')
+   fi
+   if [ -n "$bad" ]; then
+   bad_word=$(echo $bad | grep -io $word)
+   printf "Wrong headline case:\n\"$bad\": $bad_word --> $word\n"
+   fi
+done
 
 # check headline length (60 max)
 bad=$(echo "$headlines" |
diff --git a/devtools/words-case.txt b/devtools/words-case.txt
new file mode 100644
index 0..5d5490577
--- /dev/null
+++ b/devtools/words-case.txt
@@ -0,0 +1,48 @@
+OCTEON_TX
+aarch32
+aarch64
+API
+Arm
+armv7
+armv8
+CRC
+DCB
+DMA
+EEPROM
+FreeBSD
+FW
+HW
+IOVA
+L2
+L3
+L4
+LACP
+Linux
+LRO
+LSC
+MAC
+MSS
+MTU
+NIC
+NUMA
+NVM
+PCI
+PF
+PHY
+PMD
+RETA
+RSS
+Rx
+SCTP
+SW
+TOS
+TPID
+TSO
+TTL
+Tx
+UDP
+vDPA
+VF
+VLAN
+VMDq
+VSI
-- 
2.24.1



[dpdk-dev] [PATCH] doc: add DSCP rewrite action to 20.02 release notes

2020-02-24 Thread Suanming Mou
Add the actions RTE_FLOW_ACTION_TYPE_SET_[IPV4/IPV6]_DSCP support to the
release note.

Signed-off-by: Suanming Mou 
---
 doc/guides/rel_notes/release_20_02.rst | 5 +
 1 file changed, 5 insertions(+)

diff --git a/doc/guides/rel_notes/release_20_02.rst 
b/doc/guides/rel_notes/release_20_02.rst
index dfebc46..4da0010 100644
--- a/doc/guides/rel_notes/release_20_02.rst
+++ b/doc/guides/rel_notes/release_20_02.rst
@@ -235,6 +235,11 @@ New Features
   Added document describes how to enable DPDK on OpenWrt in both virtual and
   physical machine.
 
+* **Added DSCP rewrite action.**
+
+  New actions ``RTE_FLOW_ACTION_TYPE_SET_[IPV4/IPV6]_DSCP`` have been added
+  to support rewrite the DSCP field in the IP header. 
+
 
 Removed Items
 -
-- 
1.8.3.1



[dpdk-dev] [PATCH v2] doc: add DSCP rewrite action to 20.02 release notes

2020-02-24 Thread Suanming Mou
Add the actions RTE_FLOW_ACTION_TYPE_SET_[IPV4/IPV6]_DSCP support to the
release note.

Signed-off-by: Suanming Mou 
---

v2: fix coding style issue.

---
 doc/guides/rel_notes/release_20_02.rst | 5 +
 1 file changed, 5 insertions(+)

diff --git a/doc/guides/rel_notes/release_20_02.rst 
b/doc/guides/rel_notes/release_20_02.rst
index dfebc46..2e617cc 100644
--- a/doc/guides/rel_notes/release_20_02.rst
+++ b/doc/guides/rel_notes/release_20_02.rst
@@ -235,6 +235,11 @@ New Features
   Added document describes how to enable DPDK on OpenWrt in both virtual and
   physical machine.
 
+* **Added DSCP rewrite action.**
+
+  New actions ``RTE_FLOW_ACTION_TYPE_SET_[IPV4/IPV6]_DSCP`` have been added
+  to support rewrite the DSCP field in the IP header.
+
 
 Removed Items
 -
-- 
1.8.3.1



Re: [dpdk-dev] [PATCH v2] devtools: export title syntax data for check-git-log

2020-02-24 Thread Thomas Monjalon
24/02/2020 13:02, Ferruh Yigit:
> From: Sean Morrissey 
> 
> Moved title syntax to a separate file so that it improves code
> readability and allows easy addition.
> 
> Also logic changed from checking for bad pattern to checking good
> pattern which documents the expected syntax more clearly, and does not
> have gaps in the checks.
> 
> Signed-off-by: Sean Morrissey 
> Signed-off-by: Ferruh Yigit 
> ---
> v2:
> * renamed data file to words-case.txt and added to MAINTAINERS file
> * Updated script
> * as of now vdpa as file prefix is giving false positive

False positive can be avoided if filtering out what is before the first colon.





Re: [dpdk-dev] [PATCH v4 00/15] add eventmode to ipsec-secgw

2020-02-24 Thread Akhil Goyal
Hi Anoob/Lukasz,

> 
> This series introduces event-mode additions to ipsec-secgw.
> 
> With this series, ipsec-secgw would be able to run in eventmode. The
> worker thread (executing loop) would be receiving events and would be
> submitting it back to the eventdev after the processing. This way,
> multicore scaling and h/w assisted scheduling is achieved by making use
> of the eventdev capabilities.
> 
> Since the underlying event device would be having varying capabilities,
> the worker thread could be drafted differently to maximize performance.
> This series introduces usage of multiple worker threads, among which the
> one to be used will be determined by the operating conditions and the
> underlying device capabilities.
> 
> For example, if an event device - eth device pair has Tx internal port,
> then application can do tx_adapter_enqueue() instead of regular
> event_enqueue(). So a thread making an assumption that the device pair
> has internal port will not be the right solution for another pair. The
> infrastructure added with these patches aims to help application to have
> multiple worker threads, there by extracting maximum performance from
> every device without affecting existing paths/use cases.
> 
> The eventmode configuration is predefined. All packets reaching one eth
> port will hit one event queue. All event queues will be mapped to all
> event ports. So all cores will be able to receive traffic from all ports.
> When schedule_type is set as RTE_SCHED_TYPE_ORDERED/ATOMIC, event
> device
> will ensure the ordering. Ordering would be lost when tried in PARALLEL.
> 
> Following command line options are introduced,
> 
> --transfer-mode: to choose between poll mode & event mode
> --event-schedule-type: to specify the scheduling type
>  (RTE_SCHED_TYPE_ORDERED/
>   RTE_SCHED_TYPE_ATOMIC/
>   RTE_SCHED_TYPE_PARALLEL)
> 
> Additionally the event mode introduces two modes of processing packets:
> 
> Driver-mode: This mode will have bare minimum changes in the application
>  to support ipsec. There woudn't be any lookup etc done in
>  the application. And for inline-protocol use case, the
>  thread would resemble l2fwd as the ipsec processing would be
>  done entirely in the h/w. This mode can be used to benchmark
>  the raw performance of the h/w. All the application side
>  steps (like lookup) can be redone based on the requirement
>  of the end user. Hence the need for a mode which would
>  report the raw performance.
> 
> App-mode: This mode will have all the features currently implemented with
>   ipsec-secgw (non librte_ipsec mode). All the lookups etc
>   would follow the existing methods and would report numbers
>   that can be compared against regular ipsec-secgw benchmark
>   numbers.
> 
> The driver mode is selected with existing --single-sa option
> (used also by poll mode). When --single-sa option is used
> in conjution with event mode then index passed to --single-sa
> is ignored.
> 
> Example commands to execute ipsec-secgw in various modes on OCTEON TX2
> platform,
> 
> #Inbound and outbound app mode
> ipsec-secgw -w 0002:02:00.0,ipsec_in_max_spi=128 -w
> 0002:03:00.0,ipsec_in_max_spi=128 -w 0002:0e:00.0 -w 0002:10:00.1 --log-
> level=8 -c 0x1 -- -P -p 0x3 -u 0x1 --config "(1,0,0),(0,0,0)" -f aes-gcm.cfg 
> --
> transfer-mode event --event-schedule-type parallel
> 

What is the need of adding the port queue core mapping in case of event.
In case of event, all queues are given to eventdev and there is no need for 
specifying such specific mapping. In l3fwd also this was not done.

> #Inbound and outbound driver mode
> ipsec-secgw -w 0002:02:00.0,ipsec_in_max_spi=128 -w
> 0002:03:00.0,ipsec_in_max_spi=128 -w 0002:0e:00.0 -w 0002:10:00.1 --log-
> level=8 -c 0x1 -- -P -p 0x3 -u 0x1 --config "(1,0,0),(0,0,0)" -f aes-gcm.cfg 
> --
> transfer-mode event --event-schedule-type parallel --single-sa 0
> 
> This series adds non burst tx internal port workers only. It provides 
> infrastructure
> for non internal port workers, however does not define any. Also, only inline
> ipsec
> protocol mode is supported by the worker threads added.
> 
> Following are planned features,
> 1. Add burst mode workers.
> 2. Add non internal port workers.
> 3. Verify support for Rx core (the support is added but lack of h/w to 
> verify).
> 4. Add lookaside protocol support.
> 
> Following are features that Marvell won't be attempting.
> 1. Inline crypto support.
> 2. Lookaside crypto support.
> 
> For the features that Marvell won't be attempting, new workers can be
> introduced by the respective stake holders.
> 
> This series is tested on Marvell OCTEON TX2.
> This series is targeted for 20.05 release.
> 
> Changes in v4:
> * Update ipsec-secgw documentation to describe the new options as well as
>   event mode support.
> * In event m

Re: [dpdk-dev] [PATCH v4 12/15] examples/ipsec-secgw: add app mode worker

2020-02-24 Thread Akhil Goyal
Hi Lukasz/Anoob,

> 
> Add application inbound/outbound worker thread and
> IPsec application processing code for event mode.
> 
> Example ipsec-secgw command in app mode:
> ipsec-secgw -w 0002:02:00.0,ipsec_in_max_spi=128
> -w 0002:03:00.0,ipsec_in_max_spi=128 -w 0002:0e:00.0 -w 0002:10:00.1
> --log-level=8 -c 0x1 -- -P -p 0x3 -u 0x1 --config "(1,0,0),(0,0,0)"
> -f aes-gcm.cfg --transfer-mode event --event-schedule-type parallel
> 
> Signed-off-by: Anoob Joseph 
> Signed-off-by: Ankur Dwivedi 
> Signed-off-by: Lukasz Bartosik 
> ---

...

> +static inline enum pkt_type
> +process_ipsec_get_pkt_type(struct rte_mbuf *pkt, uint8_t **nlp)
> +{
> + struct rte_ether_hdr *eth;
> +
> + eth = rte_pktmbuf_mtod(pkt, struct rte_ether_hdr *);
> + if (eth->ether_type == rte_cpu_to_be_16(RTE_ETHER_TYPE_IPV4)) {
> + *nlp = RTE_PTR_ADD(eth, RTE_ETHER_HDR_LEN +
> + offsetof(struct ip, ip_p));
> + if (**nlp == IPPROTO_ESP)
> + return PKT_TYPE_IPSEC_IPV4;
> + else
> + return PKT_TYPE_PLAIN_IPV4;
> + } else if (eth->ether_type == rte_cpu_to_be_16(RTE_ETHER_TYPE_IPV6))
> {
> + *nlp = RTE_PTR_ADD(eth, RTE_ETHER_HDR_LEN +
> + offsetof(struct ip6_hdr, ip6_nxt));
> + if (**nlp == IPPROTO_ESP)
> + return PKT_TYPE_IPSEC_IPV6;
> + else
> + return PKT_TYPE_PLAIN_IPV6;
> + }
> +
> + /* Unknown/Unsupported type */
> + return PKT_TYPE_INVALID;
> +}
> +
> +static inline void
> +update_mac_addrs(struct rte_mbuf *pkt, uint16_t portid)
> +{
> + struct rte_ether_hdr *ethhdr;
> +
> + ethhdr = rte_pktmbuf_mtod(pkt, struct rte_ether_hdr *);
> + memcpy(ðhdr->s_addr, ðaddr_tbl[portid].src,
> RTE_ETHER_ADDR_LEN);
> + memcpy(ðhdr->d_addr, ðaddr_tbl[portid].dst,
> RTE_ETHER_ADDR_LEN);
> +}
> 
>  static inline void
>  ipsec_event_pre_forward(struct rte_mbuf *m, unsigned int port_id)
> @@ -61,6 +101,290 @@ prepare_out_sessions_tbl(struct sa_ctx *sa_out,
>   }
>  }
> 
> +static inline int
> +check_sp(struct sp_ctx *sp, const uint8_t *nlp, uint32_t *sa_idx)
> +{
> + uint32_t res;
> +
> + if (unlikely(sp == NULL))
> + return 0;
> +
> + rte_acl_classify((struct rte_acl_ctx *)sp, &nlp, &res, 1,
> + DEFAULT_MAX_CATEGORIES);
> +
> + if (unlikely(res == 0)) {
> + /* No match */
> + return 0;
> + }
> +
> + if (res == DISCARD)
> + return 0;
> + else if (res == BYPASS) {
> + *sa_idx = -1;
> + return 1;
> + }
> +
> + *sa_idx = res - 1;
> + return 1;
> +}
> +
> +static inline uint16_t
> +route4_pkt(struct rte_mbuf *pkt, struct rt_ctx *rt_ctx)
> +{
> + uint32_t dst_ip;
> + uint16_t offset;
> + uint32_t hop;
> + int ret;
> +
> + offset = RTE_ETHER_HDR_LEN + offsetof(struct ip, ip_dst);
> + dst_ip = *rte_pktmbuf_mtod_offset(pkt, uint32_t *, offset);
> + dst_ip = rte_be_to_cpu_32(dst_ip);
> +
> + ret = rte_lpm_lookup((struct rte_lpm *)rt_ctx, dst_ip, &hop);
> +
> + if (ret == 0) {
> + /* We have a hit */
> + return hop;
> + }
> +
> + /* else */
> + return RTE_MAX_ETHPORTS;
> +}
> +
> +/* TODO: To be tested */
> +static inline uint16_t
> +route6_pkt(struct rte_mbuf *pkt, struct rt_ctx *rt_ctx)
> +{
> + uint8_t dst_ip[16];
> + uint8_t *ip6_dst;
> + uint16_t offset;
> + uint32_t hop;
> + int ret;
> +
> + offset = RTE_ETHER_HDR_LEN + offsetof(struct ip6_hdr, ip6_dst);
> + ip6_dst = rte_pktmbuf_mtod_offset(pkt, uint8_t *, offset);
> + memcpy(&dst_ip[0], ip6_dst, 16);
> +
> + ret = rte_lpm6_lookup((struct rte_lpm6 *)rt_ctx, dst_ip, &hop);
> +
> + if (ret == 0) {
> + /* We have a hit */
> + return hop;
> + }
> +
> + /* else */
> + return RTE_MAX_ETHPORTS;
> +}
> +
> +static inline uint16_t
> +get_route(struct rte_mbuf *pkt, struct route_table *rt, enum pkt_type type)
> +{
> + if (type == PKT_TYPE_PLAIN_IPV4 || type == PKT_TYPE_IPSEC_IPV4)
> + return route4_pkt(pkt, rt->rt4_ctx);
> + else if (type == PKT_TYPE_PLAIN_IPV6 || type == PKT_TYPE_IPSEC_IPV6)
> + return route6_pkt(pkt, rt->rt6_ctx);
> +
> + return RTE_MAX_ETHPORTS;
> +}

Is it not possible to use the existing functions for finding routes, checking 
packet types and checking security policies.
It will be very difficult to manage two separate functions for same work. I can 
see that the pkt->data_offs 
Are not required to be updated in the inline case, but can we split the 
existing functions in two so that they can be 
Called in the appropriate cases.

As you have said in the cover note as well to add lookaside protocol support. I 
also tried adding it, and it will get very
Difficult to manage separate functions for separate code paths.

> +
> +static inline

Re: [dpdk-dev] [PATCH v2] devtools: export title syntax data for check-git-log

2020-02-24 Thread Ferruh Yigit
On 2/24/2020 1:39 PM, Thomas Monjalon wrote:
> 24/02/2020 13:02, Ferruh Yigit:
>> From: Sean Morrissey 
>>
>> Moved title syntax to a separate file so that it improves code
>> readability and allows easy addition.
>>
>> Also logic changed from checking for bad pattern to checking good
>> pattern which documents the expected syntax more clearly, and does not
>> have gaps in the checks.
>>
>> Signed-off-by: Sean Morrissey 
>> Signed-off-by: Ferruh Yigit 
>> ---
>> v2:
>> * renamed data file to words-case.txt and added to MAINTAINERS file
>> * Updated script
>> * as of now vdpa as file prefix is giving false positive
> 
> False positive can be avoided if filtering out what is before the first colon.
> 

It can, wasn't sure to add more work/check just to cover this case.

Also it is possible to introduce additional check for 'vdpa' only, right now it
is the only case that hits the file prefix.

Do you have a preference?


[dpdk-dev] [PATCH v3] devtools: export title syntax data for check-git-log

2020-02-24 Thread Ferruh Yigit
From: Sean Morrissey 

Moved title syntax to a separate file so that it improves code
readability and allows easy addition.

Also logic changed from checking for bad pattern to checking good
pattern which documents the expected syntax more clearly, and does not
have gaps in the checks.

Signed-off-by: Sean Morrissey 
Signed-off-by: Ferruh Yigit 
---
v2:
* renamed data file to words-case.txt and added to MAINTAINERS file
* Updated script
* as of now vdpa as file prefix is giving false positive

v3:
* add check excluding file prefix to address vdpa false positive
* remove OCTEON_TX line from words-case.txt which was a mistake
---
 MAINTAINERS   |  1 +
 devtools/check-git-log.sh | 69 +--
 devtools/words-case.txt   | 47 ++
 3 files changed, 64 insertions(+), 53 deletions(-)
 create mode 100644 devtools/words-case.txt

diff --git a/MAINTAINERS b/MAINTAINERS
index 1886b18c3..923b49680 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -95,6 +95,7 @@ F: devtools/check-dup-includes.sh
 F: devtools/check-maintainers.sh
 F: devtools/check-forbidden-tokens.awk
 F: devtools/check-git-log.sh
+F: devtools/words-case.txt
 F: devtools/check-includes.sh
 F: devtools/check-symbol-maps.sh
 F: devtools/checkpatches.sh
diff --git a/devtools/check-git-log.sh b/devtools/check-git-log.sh
index f9d055039..4e65be0e4 100755
--- a/devtools/check-git-log.sh
+++ b/devtools/check-git-log.sh
@@ -83,57 +83,22 @@ bad=$(echo "$headlines" | grep --color=always \
| sed 's,^,\t,')
 [ -z "$bad" ] || printf "Wrong headline uppercase:\n$bad\n"
 
-# check headline uppercase (Rx/Tx, VF, L2, MAC, Linux, ARM...)
-bad=$(echo "$headlines" | grep -E --color=always \
-   -e ':.*\<(rx|tx|RX|TX)\>' \
-   -e ':.*\<[pv]f\>' \
-   -e ':.*\<[hsf]w\>' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\<(Aarch64|AArch64|AARCH64|Aarch32|AArch32|AARCH32)\>' \
-   -e ':.*\<(Armv7|ARMv7|ArmV7|armV7|ARMV7)\>' \
-   -e ':.*\<(Armv8|ARMv8|ArmV8|armV8|ARMV8)\>' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\' \
-   -e ':.*\<[Vv]lan\>' \
-   -e ':.*\' \
-   -e ':.*\' \
-   | grep \
-   -v ':.*\' \
-   | sed 's,^,\t,')
-[ -z "$bad" ] || printf "Wrong headline lowercase:\n$bad\n"
-
-# special case check for VMDq to give good error message
-bad=$(echo "$headlines" | grep -E --color=always \
-   -e '\<(vmdq|VMDQ)\>' \
-   | sed 's,^,\t,')
-[ -z "$bad" ] || printf "Wrong headline capitalization, use 'VMDq':\n$bad\n"
+# check headline case (Rx/Tx, VF, L2, MAC, Linux ...)
+IFS='
+'
+words="$selfdir/words-case.txt"
+for word in $(cat $words); do
+   bad=$(echo "$headlines" | grep -iw $word | grep -v $word)
+   if [ "$word" = "Tx" ]; then
+   bad=$(echo $bad | grep -v 'OCTEON\ TX')
+   fi
+   for bad_line in $bad; do
+   bad_word=$(echo $bad_line | cut -d":" -f2 | grep -io $word)
+   if [ -n "$bad_word" ]; then
+   printf "Wrong headline case:\n\"$bad_line\": $bad_word 
--> $word\n"
+   fi
+   done
+done
 
 # check headline length (60 max)
 bad=$(echo "$headlines" |
@@ -187,8 +152,6 @@ done)
 [ -z "$bad" ] || printf "Missing 'Fixes' tag:\n$bad\n"
 
 # check Fixes: reference
-IFS='
-'
 fixtags=$(echo "$tags" | grep '^Fixes: ')
 bad=$(for fixtag in $fixtags ; do
hash=$(echo "$fixtag" | sed 's,^Fixes: \([0-9a-f]*\).*,\1,')
diff --git a/devtools/words-case.txt b/devtools/words-case.txt
new file mode 100644
index 0..6bfb81985
--- /dev/null
+++ b/devtools/words-case.txt
@@ -0,0 +1,47 @@
+aarch32
+aarch64
+API
+Arm
+armv7
+armv8
+CRC
+DCB
+DMA
+EEPROM
+FreeBSD
+FW
+HW
+IOVA
+L2
+L3
+L4
+LACP
+Linux
+LRO
+LSC
+MAC
+MSS
+MTU
+NIC
+NUMA
+NVM
+PCI
+PF
+PHY
+PMD
+RETA
+RSS
+Rx
+SCTP
+SW
+TOS
+TPID
+TSO
+TTL
+Tx
+UDP
+vDPA
+VF
+VLAN
+VMDq
+VSI
-- 
2.24.1



Re: [dpdk-dev] [PATCH v1] doc: update release notes for 20.02

2020-02-24 Thread David Marchand
On Sun, Feb 23, 2020 at 11:58 AM John McNamara  wrote:
> diff --git a/doc/guides/rel_notes/release_20_02.rst 
> b/doc/guides/rel_notes/release_20_02.rst
> index dfebc46..8a493e4 100644
> --- a/doc/guides/rel_notes/release_20_02.rst
> +++ b/doc/guides/rel_notes/release_20_02.rst
[snip]
> @@ -279,6 +284,8 @@ API Changes
> Also, make sure to start the actual text at the margin.
> =
>
> +* No changes in this release.
> +

No change ?

The rest looks good to me.
Thanks.


--
David Marchand



[dpdk-dev] [PATCH 1/2] vdpa/mlx5: fix guest notification timing

2020-02-24 Thread Matan Azrad
When the HW finishes to consume the guest Rx descriptors, it creates a
CQE in the CQ.

The mlx5 driver arms the CQ to get notifications when a specific CQE
index is created - the index to be armed is the next CQE index which
should be polled by the driver.

The mlx5 driver configured the kernel driver to send notification to the
guest callfd in the same time it arrives to the mlx5 driver.

It means that the guest was notified only for each first CQE in a poll
cycle, so if the driver polled CQEs of all the virtio queue available
descriptors, the guest was not notified again for the rest because
there was no any new cycle for polling.

Hence, the Rx queues might be stuck when the guest didn't work with
poll mode.

Move the guest notification to be after the driver consumes all the
SW own CQEs.
By this way, guest will be notified only after all the SW CQEs are
polled.

Also init the CQ to be with HW owner in the start.

Fixes: 8395927cdfaf ("vdpa/mlx5: prepare HW queues")

Signed-off-by: Matan Azrad 
---
 drivers/vdpa/mlx5/mlx5_vdpa.h   |  1 +
 drivers/vdpa/mlx5/mlx5_vdpa_event.c | 18 +++---
 2 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/drivers/vdpa/mlx5/mlx5_vdpa.h b/drivers/vdpa/mlx5/mlx5_vdpa.h
index faeb54a..3324c9d 100644
--- a/drivers/vdpa/mlx5/mlx5_vdpa.h
+++ b/drivers/vdpa/mlx5/mlx5_vdpa.h
@@ -39,6 +39,7 @@ struct mlx5_vdpa_cq {
uint16_t log_desc_n;
uint32_t cq_ci:24;
uint32_t arm_sn:2;
+   int callfd;
rte_spinlock_t sl;
struct mlx5_devx_obj *cq;
struct mlx5dv_devx_umem *umem_obj;
diff --git a/drivers/vdpa/mlx5/mlx5_vdpa_event.c 
b/drivers/vdpa/mlx5/mlx5_vdpa_event.c
index 17fd9dd..16276f5 100644
--- a/drivers/vdpa/mlx5/mlx5_vdpa_event.c
+++ b/drivers/vdpa/mlx5/mlx5_vdpa_event.c
@@ -4,6 +4,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -156,17 +157,9 @@
rte_errno = errno;
goto error;
}
-   /* Subscribe CQ event to the guest FD only if it is not in poll mode. */
-   if (callfd != -1) {
-   ret = mlx5_glue->devx_subscribe_devx_event_fd(priv->eventc,
- callfd,
- cq->cq->obj, 0);
-   if (ret) {
-   DRV_LOG(ERR, "Failed to subscribe CQE event fd.");
-   rte_errno = errno;
-   goto error;
-   }
-   }
+   cq->callfd = callfd;
+   /* Init CQ to ones to be in HW owner in the start. */
+   memset((void *)(uintptr_t)cq->umem_buf, 0xFF, attr.db_umem_offset);
/* First arming. */
mlx5_vdpa_cq_arm(priv, cq);
return 0;
@@ -231,6 +224,9 @@
rte_spinlock_lock(&cq->sl);
mlx5_vdpa_cq_poll(priv, cq);
mlx5_vdpa_cq_arm(priv, cq);
+   if (cq->callfd != -1)
+   /* Notify guest for descriptors consuming. */
+   eventfd_write(cq->callfd, (eventfd_t)1);
rte_spinlock_unlock(&cq->sl);
DRV_LOG(DEBUG, "CQ %d event: new cq_ci = %u.", cq->cq->id,
cq->cq_ci);
-- 
1.8.3.1



[dpdk-dev] [PATCH 2/2] doc: update mlx5 vDPA dependecies

2020-02-24 Thread Matan Azrad
The first Mellanox OFED version to support mlx5 vDPA driver is 5.0.

Signed-off-by: Matan Azrad 
---
 doc/guides/vdpadevs/mlx5.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/guides/vdpadevs/mlx5.rst b/doc/guides/vdpadevs/mlx5.rst
index ce7c8a7..1660192 100644
--- a/doc/guides/vdpadevs/mlx5.rst
+++ b/doc/guides/vdpadevs/mlx5.rst
@@ -56,7 +56,7 @@ Supported NICs
 Prerequisites
 -
 
-- Mellanox OFED version: **4.7**
+- Mellanox OFED version: **5.0**
   see :doc:`../../nics/mlx5` guide for more Mellanox OFED details.
 
 Compilation options
-- 
1.8.3.1



Re: [dpdk-dev] [RFC 0/6] New sync modes for ring

2020-02-24 Thread Stephen Hemminger
On Mon, 24 Feb 2020 11:35:09 +
Konstantin Ananyev  wrote:

> Upfront note - that RFC is not a complete patch.
> It introduces an ABI breakage, plus it doesn't update ring_elem
> code properly, etc.
> I plan to deal with all these things in later versions.
> Right now I seek an initial feedback about proposed ideas.
> Would also ask people to repeat performance tests (see below)
> on their platforms to confirm the impact.
> 
> More and more customers use(/try to use) DPDK based apps within
> overcommitted systems (multiple acttive threads over same pysical cores):
> VM, container deployments, etc.
> One quite common problem they hit: Lock-Holder-Preemption with rte_ring.
> LHP is quite a common problem for spin-based sync primitives
> (spin-locks, etc.) on overcommitted systems.
> The situation gets much worse when some sort of
> fair-locking technique is used (ticket-lock, etc.).
> As now not only lock-owner but also lock-waiters scheduling
> order matters a lot.
> This is a well-known problem for kernel within VMs:
> http://www-archive.xenproject.org/files/xensummitboston08/LHP.pdf
> https://www.cs.hs-rm.de/~kaiser/events/wamos2017/Slides/selcuk.pdf
> The problem with rte_ring is that while head accusion is sort of
> un-fair locking, waiting on tail is very similar to ticket lock schema -
> tail has to be updated in particular order.
> That makes current rte_ring implementation to perform
> really pure on some overcommited scenarios.

Rather than reform rte_ring to fit this scenario, it would make
more sense to me to introduce another primitive. The current lockless
ring performs very well for the isolated thread model that DPDK
was built around. This looks like a case of customers violating
the usage model of the DPDK and then being surprised at the fallout.



[dpdk-dev] [PATCH] doc: describe the pktmbuf pool with pinned extarnal memory

2020-02-24 Thread Viacheslav Ovsiienko
Document the new mbuf pools with external pinned buffers.

Signed-off-by: Viacheslav Ovsiienko 
---
 doc/guides/prog_guide/mbuf_lib.rst | 34 ++
 doc/guides/rel_notes/release_20_02.rst |  5 +
 2 files changed, 39 insertions(+)

diff --git a/doc/guides/prog_guide/mbuf_lib.rst 
b/doc/guides/prog_guide/mbuf_lib.rst
index 0d3223b..d7be5dd 100644
--- a/doc/guides/prog_guide/mbuf_lib.rst
+++ b/doc/guides/prog_guide/mbuf_lib.rst
@@ -240,6 +240,40 @@ the memory pool for indirect buffers should be configured 
to indicate the reduce
 Examples of the initialization of a memory pool for indirect buffers (as well 
as use case examples for indirect buffers)
 can be found in several of the sample applications, for example, the IPv4 
Multicast sample application.
 
+.. _external_pinned_buffer:
+
+External Pinned Buffers
+---
+
+The regular pktmbuf pools contain only mbufs with no external buffers.
+This means data buffer for the mbuf should be placed right after the
+mbuf structure (and the private data if any).
+
+On some cases, the application would want to have the buffers allocated
+from a different device in the platform. This is in order to do zero
+copy for the packet directly to the device memory. Examples for such
+devices can be GPU or storage device. For such cases the regular pktmbuf
+pool does not fit since each mbuf would need to point to external
+buffer.
+
+To support above, the new type of pktmbuf pool is introduced - the pktmbuf
+pool with pinned external memory. The pool of such type is populated with
+mbufs pointing to the device buffers using the mbuf external buffer feature.
+The PMD populates its receive queues with those buffers, so that
+every packet received will be scattered directly to the device memory.
+On the other direction, embedding the buffer pointer to the transmit
+queues of the NIC, will make the DMA to fetch device memory
+using peer to peer communication.
+
+Such mbuf with external buffer should be handled with care when mbuf is
+freed. Mainly the external buffer should not be detached, so that it can
+be reused for the next packet receive. To support this pool feature the PMDs
+should be aware that pool mbuf allocator might return the buffers with
+EXT_ATTACHED_MBUF flag set.
+
+To create the pktmbuf pool with the pinned external memory the dedicated
+routine rte_pktmbuf_pool_create_extbuf() is provided.
+
 Debug
 -
 
diff --git a/doc/guides/rel_notes/release_20_02.rst 
b/doc/guides/rel_notes/release_20_02.rst
index dfebc46..a0e6734 100644
--- a/doc/guides/rel_notes/release_20_02.rst
+++ b/doc/guides/rel_notes/release_20_02.rst
@@ -65,6 +65,10 @@ New Features
 
   New APIs have been added to support rings with custom element size.
 
+* **Added rte_mbuf pool with Pinned External Memory.**
+
+  Added support for new type of pktsmbuf pool.
+
 * **Updated rte_flow api to support L2TPv3 over IP flows.**
 
   Added support for new flow item to handle L2TPv3 over IP rte_flow patterns.
@@ -132,6 +136,7 @@ New Features
 
   * Added support for RSS using L3/L4 source/destination only.
   * Added support for matching on GTP tunnel header item.
+  * Added support for the mbufs with external pinned buffers.
   * Removed limitation of matching on tagged/untagged packets (when using DV 
flow engine).
   * Added BlueField-2 integrated ConnectX-6 Dx device support.
 
-- 
1.8.3.1



[dpdk-dev] [PATCH] doc/mlx5: update mlx5 guide

2020-02-24 Thread Viacheslav Ovsiienko
- metadata limitation is described
- no inline hint flag is described

Signed-off-by: Viacheslav Ovsiienko 
---
 doc/guides/nics/mlx5.rst | 32 
 1 file changed, 32 insertions(+)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index dd2fbde..2542248 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -96,6 +96,8 @@ Features
   increment/decrement, count, drop, mark. For details please see 
:ref:`mlx5_offloads_support`.
 - Flow insertion rate of more then million flows per second, when using Direct 
Rules.
 - Support for multiple rte_flow groups.
+- per packet no_inline hint flag to disable packet data copying into Tx
+  descriptors.
 - Hardware LRO.
 
 Limitations
@@ -162,6 +164,9 @@ Limitations
  - msg_type
  - teid
 
+- No Tx metadata go to the E-Switch steering domain for the Flow group 0.
+  The flows within group 0 and set metadata action are rejected by hardware.
+
 .. note::
 
MAC addresses not already present in the bridge table of the associated
@@ -185,6 +190,33 @@ Limitations
   To receive IPv6 Multicast messages on VM, explicitly set the relevant
   MAC address using rte_eth_dev_mac_addr_add() API.
 
+- to support a mixed traffic pattern (some buffers from local host memory, some
+  buffers from other devices) with high bandwidth, a hint flag is introduced in
+  the mbuf.
+
+  An application hints the PMD whether or not it should try to inline the
+  given mbuf data buffer. PMD should do the best effort to act upon this 
request.
+
+  The hint flag RTE_NET_MLX5_DYNFLAG_NO_INLINE_NAME is supposed to be dynamic,
+  registered by application with rte_mbuf_dynflag_register(). This flag is
+  purely vendor specific and declared in PMD specific header rte_pmd_mlx5.h,
+  which is intended to be used by specific application.
+
+  To query the supported specific flags in runtime the private routine is
+  introduced: rte_pmd_mlx5_get_dyn_flag_names. It returns the array of 
currently
+  (over present hardware and configuration)supported specific flags.
+  The "not inline hint" feature operating flow is the following one:
+- application start
+- probe the devices, ports are created
+- query the port capabilities
+- if port supporting the feature is found
+- register dynamic flag RTE_NET_MLX5_DYNFLAG_NO_INLINE_NAME
+- application starts the ports
+- on dev_start() PMD checks whether the feature flag is registered and
+  enables the feature support in datapath
+- application might set this flag in ol_flags field of mbuf in the
+  packets being sent and PMD will handle ones appropriately.
+
 - The amount of descriptors in Tx queue may be limited by data inline settings.
   Inline data require the more descriptor building blocks and overall block
   amount may exceed the hardware supported limits. The application should
-- 
1.8.3.1



Re: [dpdk-dev] [RFC 0/6] New sync modes for ring

2020-02-24 Thread Jerin Jacob
On Mon, Feb 24, 2020 at 10:29 PM Stephen Hemminger
 wrote:
>
> On Mon, 24 Feb 2020 11:35:09 +
> Konstantin Ananyev  wrote:
>
> > Upfront note - that RFC is not a complete patch.
> > It introduces an ABI breakage, plus it doesn't update ring_elem
> > code properly, etc.
> > I plan to deal with all these things in later versions.
> > Right now I seek an initial feedback about proposed ideas.
> > Would also ask people to repeat performance tests (see below)
> > on their platforms to confirm the impact.
> >
> > More and more customers use(/try to use) DPDK based apps within
> > overcommitted systems (multiple acttive threads over same pysical cores):
> > VM, container deployments, etc.
> > One quite common problem they hit: Lock-Holder-Preemption with rte_ring.
> > LHP is quite a common problem for spin-based sync primitives
> > (spin-locks, etc.) on overcommitted systems.
> > The situation gets much worse when some sort of
> > fair-locking technique is used (ticket-lock, etc.).
> > As now not only lock-owner but also lock-waiters scheduling
> > order matters a lot.
> > This is a well-known problem for kernel within VMs:
> > http://www-archive.xenproject.org/files/xensummitboston08/LHP.pdf
> > https://www.cs.hs-rm.de/~kaiser/events/wamos2017/Slides/selcuk.pdf
> > The problem with rte_ring is that while head accusion is sort of
> > un-fair locking, waiting on tail is very similar to ticket lock schema -
> > tail has to be updated in particular order.
> > That makes current rte_ring implementation to perform
> > really pure on some overcommited scenarios.
>
> Rather than reform rte_ring to fit this scenario, it would make
> more sense to me to introduce another primitive. The current lockless
> ring performs very well for the isolated thread model that DPDK
> was built around. This looks like a case of customers violating
> the usage model of the DPDK and then being surprised at the fallout.

I agree with Stephen here.

I think, adding more runtime check in the enqueue() and dequeue() will
have a bad effect on the low-end cores too.
But I agree with the problem statement that in the virtualization use
case, It may be possible to have N virtual cores runs on a physical
core.

IMO, The best solution would be keeping the ring API same and have a
different flavor in "compile-time". Something like
liburcu did for accommodating different flavors.

i.e urcu-qsbr.h and urcu-bp.h will identical definition of API. The
application can simply include ONE header file in a C file based on
the flavor.
If need both at runtime. Need to have function pointer or so in the
application and define the function in different c file by including
the approaite flavor in C file.

#include  /* QSBR RCU flavor */
#include  /* Bulletproof RCU flavor */













>


Re: [dpdk-dev] [RFC 0/6] New sync modes for ring

2020-02-24 Thread Stephen Hemminger
On Mon, 24 Feb 2020 23:29:57 +0530
Jerin Jacob  wrote:

> On Mon, Feb 24, 2020 at 10:29 PM Stephen Hemminger
>  wrote:
> >
> > On Mon, 24 Feb 2020 11:35:09 +
> > Konstantin Ananyev  wrote:
> >  
> > > Upfront note - that RFC is not a complete patch.
> > > It introduces an ABI breakage, plus it doesn't update ring_elem
> > > code properly, etc.
> > > I plan to deal with all these things in later versions.
> > > Right now I seek an initial feedback about proposed ideas.
> > > Would also ask people to repeat performance tests (see below)
> > > on their platforms to confirm the impact.
> > >
> > > More and more customers use(/try to use) DPDK based apps within
> > > overcommitted systems (multiple acttive threads over same pysical cores):
> > > VM, container deployments, etc.
> > > One quite common problem they hit: Lock-Holder-Preemption with rte_ring.
> > > LHP is quite a common problem for spin-based sync primitives
> > > (spin-locks, etc.) on overcommitted systems.
> > > The situation gets much worse when some sort of
> > > fair-locking technique is used (ticket-lock, etc.).
> > > As now not only lock-owner but also lock-waiters scheduling
> > > order matters a lot.
> > > This is a well-known problem for kernel within VMs:
> > > http://www-archive.xenproject.org/files/xensummitboston08/LHP.pdf
> > > https://www.cs.hs-rm.de/~kaiser/events/wamos2017/Slides/selcuk.pdf
> > > The problem with rte_ring is that while head accusion is sort of
> > > un-fair locking, waiting on tail is very similar to ticket lock schema -
> > > tail has to be updated in particular order.
> > > That makes current rte_ring implementation to perform
> > > really pure on some overcommited scenarios.  
> >
> > Rather than reform rte_ring to fit this scenario, it would make
> > more sense to me to introduce another primitive. The current lockless
> > ring performs very well for the isolated thread model that DPDK
> > was built around. This looks like a case of customers violating
> > the usage model of the DPDK and then being surprised at the fallout.  
> 
> I agree with Stephen here.
> 
> I think, adding more runtime check in the enqueue() and dequeue() will
> have a bad effect on the low-end cores too.
> But I agree with the problem statement that in the virtualization use
> case, It may be possible to have N virtual cores runs on a physical
> core.
> 
> IMO, The best solution would be keeping the ring API same and have a
> different flavor in "compile-time". Something like
> liburcu did for accommodating different flavors.
> 
> i.e urcu-qsbr.h and urcu-bp.h will identical definition of API. The
> application can simply include ONE header file in a C file based on
> the flavor.
> If need both at runtime. Need to have function pointer or so in the
> application and define the function in different c file by including
> the approaite flavor in C file.

This would also be a good time to consider the tradeoffs of the
heavy use of inlining that is done in rte_ring vs the impact that
has on API/ABI stability.




[dpdk-dev] [PATCH v3] doc: add IP DSCP rewrite to mlx5 and release notes

2020-02-24 Thread Thomas Monjalon
The new rte_flow feature for DSCP field rewrite offload was
missing in the release notes.

The mlx5 requirements for DSCP field rewrite offload were missing.

Fixes: 8482ffe4b68b ("ethdev: add IPv4/IPv6 DSCP rewrite action")
Fixes: 6f26e604a9c3 ("net/mlx5: support IPv4/IPv6 DSCP rewrite action")

Signed-off-by: Suanming Mou 
Signed-off-by: Thomas Monjalon 
---
 doc/guides/nics/mlx5.rst   | 5 +
 doc/guides/rel_notes/release_20_02.rst | 6 ++
 2 files changed, 11 insertions(+)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index 2af359ccee..e67bba90cb 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1184,6 +1184,11 @@ Supported hardware offloads
| | set_mac_src /   | |   | |   |
| | set_mac_dst)| |   | |   |
+---+-+-+
+   | | Header rewrite  | | DPDK 20.02| | DPDK 20.02|
+   | | (set_dscp)  | | OFED 5.0  | | OFED 5.0  |
+   | | | | rdma-core 24  | | rdma-core 24  |
+   | | | | ConnectX-5| | ConnectX-5|
+   +---+-+-+
| Jump  | | DPDK 19.05| | DPDK 19.02|
|   | | OFED 4.7-1| | OFED 4.7-1|
|   | | rdma-core 24  | | N/A   |
diff --git a/doc/guides/rel_notes/release_20_02.rst 
b/doc/guides/rel_notes/release_20_02.rst
index dfebc467e0..0c33b342aa 100644
--- a/doc/guides/rel_notes/release_20_02.rst
+++ b/doc/guides/rel_notes/release_20_02.rst
@@ -69,6 +69,11 @@ New Features
 
   Added support for new flow item to handle L2TPv3 over IP rte_flow patterns.
 
+* **Added DSCP rewrite action.**
+
+  New actions ``RTE_FLOW_ACTION_TYPE_SET_IPV[4/6]_DSCP`` have been added
+  to support rewrite the DSCP field in the IP header.
+
 * **Added IONIC net PMD.**
 
   Added the new ``ionic`` net driver for Pensando Ethernet Network Adapters.
@@ -133,6 +138,7 @@ New Features
   * Added support for RSS using L3/L4 source/destination only.
   * Added support for matching on GTP tunnel header item.
   * Removed limitation of matching on tagged/untagged packets (when using DV 
flow engine).
+  * Added support for IPv4/IPv6 DSCP rewrite action.
   * Added BlueField-2 integrated ConnectX-6 Dx device support.
 
 * **Add new vDPA PMD based on Mellanox devices**
-- 
2.25.0



[dpdk-dev] [PATCH] doc: remove redundant line in mlx5 guide

2020-02-24 Thread Thomas Monjalon
When adding GTP to the list of supported tunnels,
the old line was not removed.

Fixes: f31d7a0171da ("net/mlx5: support GTP")

Signed-off-by: Thomas Monjalon 
---
 doc/guides/nics/mlx5.rst | 1 -
 1 file changed, 1 deletion(-)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index 62456b07e4..2af359ccee 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -91,7 +91,6 @@ Features
 - RX interrupts.
 - Statistics query including Basic, Extended and per queue.
 - Rx HW timestamp.
-- Tunnel types: VXLAN, L3 VXLAN, VXLAN-GPE, GRE, MPLSoGRE, MPLSoUDP, IP-in-IP, 
Geneve.
 - Tunnel types: VXLAN, L3 VXLAN, VXLAN-GPE, GRE, MPLSoGRE, MPLSoUDP, IP-in-IP, 
Geneve, GTP.
 - Tunnel HW offloads: packet type, inner/outer RSS, IP and UDP checksum 
verification.
 - NIC HW offloads: encapsulation (vxlan, gre, mplsoudp, mplsogre), NAT, 
routing, TTL
-- 
2.25.0



[dpdk-dev] [PATCH] doc: remove not supported features from hinic matrix

2020-02-24 Thread Thomas Monjalon
Only the supported features are supposed to be listed
in the networking features matrix.

Signed-off-by: Thomas Monjalon 
---
 doc/guides/nics/features/hinic.ini | 4 
 1 file changed, 4 deletions(-)

diff --git a/doc/guides/nics/features/hinic.ini 
b/doc/guides/nics/features/hinic.ini
index 7d431e8c7e..5be05d36fb 100644
--- a/doc/guides/nics/features/hinic.ini
+++ b/doc/guides/nics/features/hinic.ini
@@ -38,9 +38,5 @@ FW version   = Y
 Multiprocess aware   = Y
 Linux UIO= Y
 Linux VFIO   = Y
-BSD nic_uio  = N
 x86-64   = Y
 ARMv8= Y
-ARMv7= N
-x86-32   = N
-Power8   = N
-- 
2.25.0



[dpdk-dev] [PATCH] doc: fix naming of Mellanox devices

2020-02-24 Thread Thomas Monjalon
The devices of the family ConnectX may have two letters as suffix.
Such suffix is preceded with a space and the second x is lowercase:
- ConnectX-4 Lx
- ConnectX-5 Ex
- ConnectX-6 Dx

Uppercase of the device family name BlueField is also fixed.

The lists of supported devices are fixed.

Signed-off-by: Thomas Monjalon 
---
 config/common_base |  4 +-
 doc/guides/nics/mlx4.rst   |  5 --
 doc/guides/nics/mlx5.rst   | 81 --
 doc/guides/rel_notes/release_18_08.rst |  2 +-
 doc/guides/rel_notes/release_19_02.rst |  2 +-
 doc/guides/rel_notes/release_19_11.rst |  2 +-
 doc/guides/vdpadevs/mlx5.rst   | 12 ++--
 drivers/net/mlx5/mlx5.c|  2 +-
 drivers/net/mlx5/mlx5_rxtx.c   |  2 +-
 drivers/net/mlx5/mlx5_txq.c|  2 +-
 10 files changed, 64 insertions(+), 50 deletions(-)

diff --git a/config/common_base b/config/common_base
index 6ea9c63cc3..7ca2f28b19 100644
--- a/config/common_base
+++ b/config/common_base
@@ -361,13 +361,13 @@ CONFIG_RTE_LIBRTE_MLX4_DEBUG=n
 
 #
 # Compile burst-oriented Mellanox ConnectX-4, ConnectX-5,
-# ConnectX-6 & Bluefield (MLX5) PMD
+# ConnectX-6 & BlueField (MLX5) PMD
 #
 CONFIG_RTE_LIBRTE_MLX5_PMD=n
 CONFIG_RTE_LIBRTE_MLX5_DEBUG=n
 
 #
-# Compile vdpa-oriented Mellanox ConnectX-6 & Bluefield (MLX5) PMD
+# Compile vdpa-oriented Mellanox ConnectX-6 & BlueField (MLX5) PMD
 #
 CONFIG_RTE_LIBRTE_MLX5_VDPA_PMD=n
 
diff --git a/doc/guides/nics/mlx4.rst b/doc/guides/nics/mlx4.rst
index 4b1d1aceb2..1f1e2f6c77 100644
--- a/doc/guides/nics/mlx4.rst
+++ b/doc/guides/nics/mlx4.rst
@@ -298,11 +298,6 @@ Installing Mellanox OFED
 
 5. Continue with :ref:`section 2 of the Quick Start Guide `.
 
-Supported NICs
---
-
-* Mellanox(R) ConnectX(R)-3 Pro 40G MCX354A-FCC_Ax (2*40G)
-
 .. _qsg:
 
 Quick Start Guide
diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index dd2fbde605..7da96d2e93 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -2,12 +2,14 @@
 Copyright 2015 6WIND S.A.
 Copyright 2015 Mellanox Technologies, Ltd
 
+.. include:: 
+
 MLX5 poll mode driver
 =
 
 The MLX5 poll mode driver library (**librte_pmd_mlx5**) provides support
 for **Mellanox ConnectX-4**, **Mellanox ConnectX-4 Lx** , **Mellanox
-ConnectX-5**, **Mellanox ConnectX-6**, **Mellanox ConnectX-6DX** and
+ConnectX-5**, **Mellanox ConnectX-6**, **Mellanox ConnectX-6 Dx** and
 **Mellanox BlueField** families of 10/25/40/50/100/200 Gb/s adapters
 as well as their virtual functions (VF) in SR-IOV context.
 
@@ -324,9 +326,9 @@ Run-time configuration
 
   Supported on:
 
-  - x86_64 with ConnectX-4, ConnectX-4 LX, ConnectX-5, ConnectX-6, ConnectX-6 
DX
+  - x86_64 with ConnectX-4, ConnectX-4 Lx, ConnectX-5, ConnectX-6, ConnectX-6 
Dx
 and BlueField.
-  - POWER9 and ARMv8 with ConnectX-4 LX, ConnectX-5, ConnectX-6, ConnectX-6 DX
+  - POWER9 and ARMv8 with ConnectX-4 Lx, ConnectX-5, ConnectX-6, ConnectX-6 Dx
 and BlueField.
 
 - ``rxq_cqe_pad_en`` parameter [int]
@@ -357,9 +359,9 @@ Run-time configuration
 
   Supported on:
 
-  - x86_64 with ConnectX-4, ConnectX-4 LX, ConnectX-5, ConnectX-6, ConnectX-6 
DX
+  - x86_64 with ConnectX-4, ConnectX-4 Lx, ConnectX-5, ConnectX-6, ConnectX-6 
Dx
 and BlueField.
-  - POWER8 and ARMv8 with ConnectX-4 LX, ConnectX-5, ConnectX-6, ConnectX-6 DX
+  - POWER8 and ARMv8 with ConnectX-4 Lx, ConnectX-5, ConnectX-6, ConnectX-6 Dx
 and BlueField.
 
 - ``mprq_en`` parameter [int]
@@ -462,14 +464,14 @@ Run-time configuration
   If ``txq_inline_min`` key is not present, the value may be queried by the
   driver from the NIC via DevX if this feature is available. If there is no 
DevX
   enabled/supported the value 18 (supposing L2 header including VLAN) is set
-  for ConnectX-4 and ConnectX-4LX, and 0 is set by default for ConnectX-5
+  for ConnectX-4 and ConnectX-4 Lx, and 0 is set by default for ConnectX-5
   and newer NICs. If packet is shorter the ``txq_inline_min`` value, the entire
   packet is inlined.
 
   For ConnectX-4 NIC, driver does not allow specifying value below 18
   (minimal L2 header, including VLAN), error will be raised.
 
-  For ConnectX-4LX NIC, it is allowed to specify values below 18, but
+  For ConnectX-4 Lx NIC, it is allowed to specify values below 18, but
   it is not recommended and may prevent NIC from sending packets over
   some configurations.
 
@@ -552,7 +554,7 @@ Run-time configuration
 - ``txq_mpw_en`` parameter [int]
 
   A nonzero value enables Enhanced Multi-Packet Write (eMPW) for ConnectX-5,
-  ConnectX-6, ConnectX-6 DX and BlueField. eMPW allows the TX burst function 
to pack
+  ConnectX-6, ConnectX-6 Dx and BlueField. eMPW allows the TX burst function 
to pack
   up multiple packets in a single descriptor session in order to save PCI 
bandwidth
   and improve performance at the cost of a slightly higher CPU usage. When
   ``txq_inline_mpw`` is set along with `

Re: [dpdk-dev] [PATCH] doc: fix naming of Mellanox devices

2020-02-24 Thread Thomas Monjalon
> > The devices of the family ConnectX may have two letters as suffix.
> > Such suffix is preceded with a space and the second x is lowercase:
> > - ConnectX-4 Lx
> > - ConnectX-5 Ex
> > - ConnectX-6 Dx
> > 
> > Uppercase of the device family name BlueField is also fixed.
> > 
> > Signed-off-by: Thomas Monjalon 
> Acked-by: Viacheslav Ovsiienko 

I've sent another version of this patch and forgot there was already a v1.
https://patches.dpdk.org/patch/66019/




[dpdk-dev] [RFC 1/1] lib/ring: add scatter gather and serial dequeue APIs

2020-02-24 Thread Honnappa Nagarahalli
Add scatter gather APIs to avoid intermediate memcpy. Serial
dequeue APIs are added to support access to ring elements
before actual dequeue.

Signed-off-by: Honnappa Nagarahalli 
Reviewed-by: Gavin Hu 
Reviewed-by: Ola Liljedahl 
---
 lib/librte_ring/Makefile   |   1 +
 lib/librte_ring/meson.build|   1 +
 lib/librte_ring/rte_ring_c11_mem.h |  98 +++
 lib/librte_ring/rte_ring_elem_sg.h | 417 +
 lib/librte_ring/rte_ring_generic.h |  93 +++
 5 files changed, 610 insertions(+)
 create mode 100644 lib/librte_ring/rte_ring_elem_sg.h

diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
index 917c560ad..824e4a9bb 100644
--- a/lib/librte_ring/Makefile
+++ b/lib/librte_ring/Makefile
@@ -17,6 +17,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
 # install includes
 SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
rte_ring_elem.h \
+   rte_ring_elem_sg.h \
rte_ring_generic.h \
rte_ring_c11_mem.h
 
diff --git a/lib/librte_ring/meson.build b/lib/librte_ring/meson.build
index f2f3ccc88..30115ad7c 100644
--- a/lib/librte_ring/meson.build
+++ b/lib/librte_ring/meson.build
@@ -4,6 +4,7 @@
 sources = files('rte_ring.c')
 headers = files('rte_ring.h',
'rte_ring_elem.h',
+   'rte_ring_elem_sg.h',
'rte_ring_c11_mem.h',
'rte_ring_generic.h')
 
diff --git a/lib/librte_ring/rte_ring_c11_mem.h 
b/lib/librte_ring/rte_ring_c11_mem.h
index 0fb73a337..dcae8bcc0 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -178,4 +178,102 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
return n;
 }
 
+/**
+ * @internal This function updates the consumer head if there are no
+ * prior reserved elements on the ring.
+ *
+ * @param r
+ *   A pointer to the ring structure
+ * @param is_sc
+ *   Indicates whether multi-consumer path is needed or not
+ * @param n
+ *   The number of elements to dequeue, i.e. how far should the head be moved
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:Dequeue a fixed number of items from a ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from ring
+ * @param old_head
+ *   Returns head value as it was before the move, i.e. where dequeue starts
+ * @param new_head
+ *   Returns the current/new head value i.e. where dequeue finishes
+ * @param entries
+ *   Returns the number of entries in the ring BEFORE head was moved
+ * @return
+ *   - Actual number of objects dequeued.
+ * If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_move_cons_head_serial(struct rte_ring *r, int is_sc,
+   unsigned int n, enum rte_ring_queue_behavior behavior,
+   uint32_t *old_head, uint32_t *new_head,
+   uint32_t *entries)
+{
+   unsigned int max = n;
+   uint32_t prod_tail;
+   uint32_t cons_tail;
+   int success;
+
+   /* move cons.head atomically */
+   *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_RELAXED);
+   do {
+   /* Restore n as it may change every loop */
+   n = max;
+
+   /* Load the cons.tail and ensure that it is the
+* same as cons.head. load-acquire synchronizes
+* with the store-release in update_tail.
+*/
+   cons_tail = __atomic_load_n(&r->cons.tail, __ATOMIC_ACQUIRE);
+   if (*old_head != cons_tail) {
+   rte_pause();
+   *old_head = __atomic_load_n(&r->cons.head,
+   __ATOMIC_RELAXED);
+   success = 0;
+   continue;
+   }
+
+   /* this load-acquire synchronize with store-release of ht->tail
+* in update_tail.
+*/
+   prod_tail = __atomic_load_n(&r->prod.tail,
+   __ATOMIC_ACQUIRE);
+
+   /* The subtraction is done between two unsigned 32bits value
+* (the result is always modulo 32 bits even if we have
+* cons_head > prod_tail). So 'entries' is always between 0
+* and size(ring)-1.
+*/
+   *entries = (prod_tail - *old_head);
+
+   /* Set the actual entries for dequeue */
+   if (n > *entries)
+   n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 : *entries;
+
+   if (unlikely(n == 0))
+   return 0;
+
+   *new_head = *old_head + n;
+   if (is_sc)
+   r->cons.head = *new_head, success = 1;
+   else
+   /* on failure, *old_head will be update

[dpdk-dev] [RFC 0/1] lib/ring: add scatter gather and serial dequeue APIs

2020-02-24 Thread Honnappa Nagarahalli
Cover-letter:
RCU defer queue (DQ) APIs place 3 requirements on rte_ring library.
1) Multiple entities are responsible for providing/consuming the
   data in a single element of the DQ. Existing rte_ring APIs
   require an intermediate memcpy in such cases.

   RCU DQ API, rte_rcu_qsbr_dq_enqueue, has to store the token
   (generated by the RCU DQ APIs) and the application data
   (provided by the application) in each element of the DQ.

2) Dequeue the data from DQ only if it meets certain criteria.
   i.e. data needs to be accessed before it can be dequeued.

   RCU DQ API, rte_rcu_qsbr_dq_reclaim, can dequeue the
   element from DQ, only if the token at the head of
   the DQ can be reclaimed. If the token cannot be reclaimed
   the reserved elements need to be discarded.

3) While dequeuing from DQ, only one thread should be allowed
   to reserve elements.

   In order to make rte_rcu_qsbr_dq_reclaim API lock free, the
   'reserve elements in DQ, reclaim the token and revert/commit
   elements in DQ' process needs to be atomic.

The first requirement is satisfied by providing scatter-gather APIs
in rte_ring library. The enqueue/dequeue operations are split
into 3 parts:
a) Move produer's/consumer's head index. i.e. reserve elements in
   the ring and return the pointer in the ring to where the data
   needs to be copied to/from.
b) The application copies the data to/from the pointer.
c) Move producer's/consumer's tail index. i.e. indicate that
   the reserved elements are successfully consumed.
RCU DQ APIs require single element, multiple producer enqueue
operations. 'rte_ring_mp_enqueue_elem_reserve' and
'rte_ring_mp_enqueue_elem_commit' APIs are provided to address
these requirements.

The second and third requirements are satisfied by providing
rte_ring APIs for:
a) Move consumer's head index only if there are no elements
   reserved on the ring.
b) Reset the consumer's head index to its original value.
   i.e. discard the reserved elements.

In this case, RCU DQ APIs require single element, single consumer
dequeue operations. 'rte_ring_dequeue_elem_reserve_serial',
'rte_ring_dequeue_elem_commit' and
'rte_ring_dequeue_elem_revert_serial' APIs are provided to address
these requirements.

Honnappa Nagarahalli (1):
  lib/ring: add scatter gather and serial dequeue APIs

 lib/librte_ring/Makefile   |   1 +
 lib/librte_ring/meson.build|   1 +
 lib/librte_ring/rte_ring_c11_mem.h |  98 +++
 lib/librte_ring/rte_ring_elem_sg.h | 417 +
 lib/librte_ring/rte_ring_generic.h |  93 +++
 5 files changed, 610 insertions(+)
 create mode 100644 lib/librte_ring/rte_ring_elem_sg.h

-- 
2.17.1



Re: [dpdk-dev] [RFC 0/6] New sync modes for ring

2020-02-24 Thread Honnappa Nagarahalli



> -Original Message-
> From: dev  On Behalf Of Stephen Hemminger
> Sent: Monday, February 24, 2020 1:35 PM
> To: Jerin Jacob 
> Cc: Konstantin Ananyev ; dpdk-dev
> ; Olivier Matz 
> Subject: Re: [dpdk-dev] [RFC 0/6] New sync modes for ring
> 
> On Mon, 24 Feb 2020 23:29:57 +0530
> Jerin Jacob  wrote:
> 
> > On Mon, Feb 24, 2020 at 10:29 PM Stephen Hemminger
> >  wrote:
> > >
> > > On Mon, 24 Feb 2020 11:35:09 +
> > > Konstantin Ananyev  wrote:
> > >
> > > > Upfront note - that RFC is not a complete patch.
> > > > It introduces an ABI breakage, plus it doesn't update ring_elem
> > > > code properly, etc.
> > > > I plan to deal with all these things in later versions.
> > > > Right now I seek an initial feedback about proposed ideas.
> > > > Would also ask people to repeat performance tests (see below) on
> > > > their platforms to confirm the impact.
> > > >
> > > > More and more customers use(/try to use) DPDK based apps within
> > > > overcommitted systems (multiple acttive threads over same pysical
> cores):
> > > > VM, container deployments, etc.
> > > > One quite common problem they hit: Lock-Holder-Preemption with
> rte_ring.
> > > > LHP is quite a common problem for spin-based sync primitives
> > > > (spin-locks, etc.) on overcommitted systems.
> > > > The situation gets much worse when some sort of fair-locking
> > > > technique is used (ticket-lock, etc.).
> > > > As now not only lock-owner but also lock-waiters scheduling order
> > > > matters a lot.
> > > > This is a well-known problem for kernel within VMs:
> > > > http://www-archive.xenproject.org/files/xensummitboston08/LHP.pdf
> > > > https://www.cs.hs-rm.de/~kaiser/events/wamos2017/Slides/selcuk.pdf
> > > > The problem with rte_ring is that while head accusion is sort of
> > > > un-fair locking, waiting on tail is very similar to ticket lock
> > > > schema - tail has to be updated in particular order.
> > > > That makes current rte_ring implementation to perform really pure
> > > > on some overcommited scenarios.
> > >
> > > Rather than reform rte_ring to fit this scenario, it would make more
> > > sense to me to introduce another primitive. The current lockless
> > > ring performs very well for the isolated thread model that DPDK was
> > > built around. This looks like a case of customers violating the
> > > usage model of the DPDK and then being surprised at the fallout.
> >
> > I agree with Stephen here.
> >
> > I think, adding more runtime check in the enqueue() and dequeue() will
> > have a bad effect on the low-end cores too.
> > But I agree with the problem statement that in the virtualization use
> > case, It may be possible to have N virtual cores runs on a physical
> > core.
> >
> > IMO, The best solution would be keeping the ring API same and have a
> > different flavor in "compile-time". Something like liburcu did for
> > accommodating different flavors.
> >
> > i.e urcu-qsbr.h and urcu-bp.h will identical definition of API. The
> > application can simply include ONE header file in a C file based on
> > the flavor.
> > If need both at runtime. Need to have function pointer or so in the
> > application and define the function in different c file by including
> > the approaite flavor in C file.
> 
> This would also be a good time to consider the tradeoffs of the heavy use of
> inlining that is done in rte_ring vs the impact that has on API/ABI stability.
> 
I was working on few requirements in rte_ring library for RCU defer APIs. RFC 
is at https://patchwork.dpdk.org/cover/66020/. 

[dpdk-dev] [PATCH v4] devtools: add new SPDX license compliance checker

2020-02-24 Thread Stephen Hemminger
Simple script to look for drivers and scripts that
are missing requires SPDX header.

Update the contribution guidelines to indicate that SPDX license
identfier is required for this project.

Signed-off-by: Stephen Hemminger 
---
v4 - add MAINTAINERS entry
 update coding style document
 change name of script

 MAINTAINERS  |  1 +
 devtools/check-spdx-tag.sh   | 77 
 doc/guides/contributing/coding_style.rst |  9 ++-
 3 files changed, 85 insertions(+), 2 deletions(-)
 create mode 100755 devtools/check-spdx-tag.sh

diff --git a/MAINTAINERS b/MAINTAINERS
index 3d5e8d1104b2..6b0e042c5fbb 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -96,6 +96,7 @@ F: devtools/check-maintainers.sh
 F: devtools/check-forbidden-tokens.awk
 F: devtools/check-git-log.sh
 F: devtools/check-includes.sh
+F: devtools/check-spdx-tag.sh
 F: devtools/check-symbol-maps.sh
 F: devtools/checkpatches.sh
 F: devtools/get-maintainer.sh
diff --git a/devtools/check-spdx-tag.sh b/devtools/check-spdx-tag.sh
new file mode 100755
index ..b1b8cdba4e4e
--- /dev/null
+++ b/devtools/check-spdx-tag.sh
@@ -0,0 +1,77 @@
+#! /bin/sh
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright (c) 2019 Microsoft Corporation
+#
+# Produce a list of files with incorrect license tags
+
+print_usage () {
+echo "usage: $(basename $0) [-q] [-v]"
+exit 1
+}
+
+check_spdx() {
+if  $verbose;  then
+   echo "Files without SPDX License"
+   echo "--"
+fi
+git grep -L SPDX-License-Identifier -- \
+   ':^.git*' ':^.ci/*' ':^.travis.yml' \
+   ':^README' ':^MAINTAINERS' ':^VERSION' ':^ABI_VERSION' \
+   ':^*/Kbuild' ':^*/README' \
+   ':^license/' ':^doc/' ':^config/' ':^buildtools/' \
+   ':^*.cocci' ':^*.abignore' \
+   ':^*.def' ':^*.map' ':^*.ini' ':^*.data' ':^*.cfg' ':^*.txt' \
+   > $tmpfile
+
+errors=0
+while read -r line
+do $quiet || echo $line
+   errors=$((errors + 1))
+done < $tmpfile
+}
+
+check_boilerplate() {
+if $verbose ; then
+   echo
+   echo "Files with redundant license text"
+   echo "-"
+fi
+
+git grep -l Redistribution -- \
+   ':^license/' ':^/devtools/check-spdx-tag.sh' |
+   while read line
+   do $quiet || echo $line
+  warnings=$((warnings + 1))
+   done
+
+warnings=0
+while read -r line
+do $quiet || echo $line
+   warnings=$((errors + 1))
+done < $tmpfile
+}
+
+quiet=false
+verbose=false
+
+while getopts qvh ARG ; do
+   case $ARG in
+   q ) quiet=true ;;
+   v ) verbose=true ;;
+   h ) print_usage ; exit 0 ;;
+   ? ) print_usage ; exit 1 ;;
+   esac
+done
+shift $(($OPTIND - 1))
+
+tmpfile=$(mktemp)
+trap 'rm -f -- "$tmpfile"' INT TERM HUP EXIT
+
+check_spdx
+$quiet || echo
+
+check_boilerplate
+
+$quiet || echo
+echo "total: $errors errors, $warnings warnings"
+exit $errors
diff --git a/doc/guides/contributing/coding_style.rst 
b/doc/guides/contributing/coding_style.rst
index 841ef6d5c829..04626667dc18 100644
--- a/doc/guides/contributing/coding_style.rst
+++ b/doc/guides/contributing/coding_style.rst
@@ -54,8 +54,13 @@ To document a public API, a doxygen-like format must be 
used: refer to :ref:`dox
 License Header
 ~~
 
-Each file should begin with a special comment containing the appropriate 
copyright and license for the file.
-Generally this is the BSD License, except for code for Linux Kernel modules.
+Each file must begin with a special comment containing the
+`Software Package Data Exchange (SPDX) License Identfier 
`_.
+
+Generally this is the BSD License, except for code granted special exceptions.
+The SPDX licences identifier is sufficient, a file should not contain
+an additional text version of the license (boilerplate).
+
 After any copyright header, a blank line should be left before any other 
contents, e.g. include statements in a C file.
 
 C Preprocessor Directives
-- 
2.20.1



Re: [dpdk-dev] Permissions Required to Run DPDK/MLX5 as Non-Root User?

2020-02-24 Thread David Christensen




On 2/20/20 11:24 AM, David Christensen wrote:
Running DPDK 20.02-rc3 and attempting to use the MLX5 PMD as a non-root 
user.  When starting testpmd I'm receiving the following error on both 
Power and x86_64 platforms (full output further down):


   net_mlx5: probe of PCI device :01:00.0 aborted after encountering 
an error: Operation not supported


Ended up with two answers:

1) Manually clean out /dev/hugepages.  Often encountered root owned 
files left over from previous test runs, causing memory allocation 
failures when DPDK couldn't use the name it wanted (rtemap_*).


2) Needed to manually add special permissions to dpdk-testpmd (thanks 
Shahaf):
# setcap cap_net_raw,cap_net_admin,cap_ipc_lock+eip 
~/src/dpdk/build/app/dpdk-testpmd


Dave


[dpdk-dev] [PATCH] af_packet: allow configuring number of rings

2020-02-24 Thread Stephen Hemminger
The maximum number of rings in af_packet is hard coded as 16.
The user should be able to configure this as part of DPDK config.

Signed-off-by: Stephen Hemminger 
---
 config/common_base| 1 +
 drivers/net/af_packet/rte_eth_af_packet.c | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/config/common_base b/config/common_base
index 6ea9c63cc36b..dd154a474b57 100644
--- a/config/common_base
+++ b/config/common_base
@@ -468,6 +468,7 @@ CONFIG_RTE_LIBRTE_VMXNET3_DEBUG_TX_FREE=n
 # Compile software PMD backed by AF_PACKET sockets (Linux only)
 #
 CONFIG_RTE_LIBRTE_PMD_AF_PACKET=n
+CONFIG_RTE_PMD_AF_PACKET_MAX_RINGS=16
 
 #
 # Compile software PMD backed by AF_XDP sockets (Linux only)
diff --git a/drivers/net/af_packet/rte_eth_af_packet.c 
b/drivers/net/af_packet/rte_eth_af_packet.c
index f5806bf42c46..6e90d231ca4f 100644
--- a/drivers/net/af_packet/rte_eth_af_packet.c
+++ b/drivers/net/af_packet/rte_eth_af_packet.c
@@ -6,6 +6,7 @@
  * All rights reserved.
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -37,7 +38,9 @@
 #define DFLT_FRAME_SIZE(1 << 11)
 #define DFLT_FRAME_COUNT   (1 << 9)
 
+#ifndef RTE_PMD_AF_PACKET_MAX_RINGS
 #define RTE_PMD_AF_PACKET_MAX_RINGS 16
+#endif
 
 struct pkt_rx_queue {
int sockfd;
-- 
2.20.1



Re: [dpdk-dev] [RFC 0/6] New sync modes for ring

2020-02-24 Thread Honnappa Nagarahalli


> 
> On Mon, Feb 24, 2020 at 10:29 PM Stephen Hemminger
>  wrote:
> >
> > On Mon, 24 Feb 2020 11:35:09 +
> > Konstantin Ananyev  wrote:
> >
> > > Upfront note - that RFC is not a complete patch.
> > > It introduces an ABI breakage, plus it doesn't update ring_elem code
> > > properly, etc.
> > > I plan to deal with all these things in later versions.
> > > Right now I seek an initial feedback about proposed ideas.
> > > Would also ask people to repeat performance tests (see below) on
> > > their platforms to confirm the impact.
> > >
> > > More and more customers use(/try to use) DPDK based apps within
> > > overcommitted systems (multiple acttive threads over same pysical cores):
> > > VM, container deployments, etc.
> > > One quite common problem they hit: Lock-Holder-Preemption with
> rte_ring.
> > > LHP is quite a common problem for spin-based sync primitives
> > > (spin-locks, etc.) on overcommitted systems.
> > > The situation gets much worse when some sort of fair-locking
> > > technique is used (ticket-lock, etc.).
> > > As now not only lock-owner but also lock-waiters scheduling order
> > > matters a lot.
> > > This is a well-known problem for kernel within VMs:
> > > http://www-archive.xenproject.org/files/xensummitboston08/LHP.pdf
> > > https://www.cs.hs-rm.de/~kaiser/events/wamos2017/Slides/selcuk.pdf
These slides seem to indicate that the problems are mitigated through the 
Hypervisor configuration. Do we still need to address the issues?

> > > The problem with rte_ring is that while head accusion is sort of
> > > un-fair locking, waiting on tail is very similar to ticket lock
> > > schema - tail has to be updated in particular order.
> > > That makes current rte_ring implementation to perform really pure on
> > > some overcommited scenarios.
> >
> > Rather than reform rte_ring to fit this scenario, it would make more
> > sense to me to introduce another primitive. The current lockless ring
> > performs very well for the isolated thread model that DPDK was built
> > around. This looks like a case of customers violating the usage model
> > of the DPDK and then being surprised at the fallout.
> 
> I agree with Stephen here.
> 
> I think, adding more runtime check in the enqueue() and dequeue() will have a
> bad effect on the low-end cores too.
> But I agree with the problem statement that in the virtualization use case, It
> may be possible to have N virtual cores runs on a physical core.
It is hard to imagine that there are data plane applications deployed in such 
environments. Wouldn't this affect the performance terribly?

> 
> IMO, The best solution would be keeping the ring API same and have a
> different flavor in "compile-time". Something like liburcu did for
> accommodating different flavors.
> 
> i.e urcu-qsbr.h and urcu-bp.h will identical definition of API. The 
> application
> can simply include ONE header file in a C file based on the flavor.
> If need both at runtime. Need to have function pointer or so in the 
> application
> and define the function in different c file by including the approaite flavor 
> in C
> file.
> 
> #include  /* QSBR RCU flavor */ #include  /*
> Bulletproof RCU flavor */
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> >


Re: [dpdk-dev] Windows Draft Build

2020-02-24 Thread Dmitry Kozlyuk
> Oh boy! If DriverEntry() is not being called, the loader is finding something 
> wrong and is unable to load the driver.
> 
> @Harini: Can Jeffrey (@MS) help here?

By trial and error I determined that changing device class from "Net" to a
custom one fixes the issue (see attached patch), root cause is still unknown.

From this point, running virtio PMD is blocked by lack of multi-BAR support
in NETUIO, which is a separate concern, outside of scope of this thread.

-- 
Dmitry Kozlyuk


Re: [dpdk-dev] [RFC PATCH 0/5] graph: introduce graph subsystem

2020-02-24 Thread Honnappa Nagarahalli


> 
> From: Jerin Jacob 
> 
> This RFC is targeted for v20.05 release.
> 
> This RFC patch includes an implementation of graph architecture for packet
> processing using DPDK primitives.
> 
> Using graph traversal for packet processing is a proven architecture that has
> been implemented in various open source libraries.
> 
> Graph architecture for packet processing enables abstracting the data
> processing functions as “nodes” and “links” them together to create a complex
> “graph” to create reusable/modular data processing functions.
> 
> The RFC patch further includes performance enhancements and modularity to
> the DPDK as discussed in more detail below.
> 
> What this RFC patch contains:
> -
> 1) The API definition to "create" nodes and "link" together to create a 
> "graph"
> for packet processing. See, lib/librte_graph/rte_graph.h
> 
> 2) The Fast path API definition for the graph walker and enqueue function
> used by the workers. See, lib/librte_graph/rte_graph_worker.h
> 
> 3) Optimized SW implementation for (1) and (2). See, lib/librte_graph/
> 
> 4) Test case to verify the graph infrastructure functionality See,
> app/test/test_graph.c
> 
> 5) Performance test cases to evaluate the cost of graph walker and nodes
> enqueue fast-path function for various combinations.
> 
> See app/test/test_graph_perf.c
> 
> 6) Packet processing nodes(Null, Rx, Tx, Pkt drop, IPV4 rewrite, IPv4 lookup)
> using graph infrastructure. See lib/librte_node/*
> 
> 7) An example application to showcase l3fwd (functionality same as existing
> examples/l3fwd) using graph infrastructure and use packets processing nodes
> (item (6)). See examples/l3fwd-graph/.
> 
> Performance
> ---
> 1) Graph walk and node enqueue overhead can be tested with performance
> test case application [1] # If all packets go from a node to another node (we
> call it as "homerun") then it will be just a pointer swap for a burst of 
> packets.
> # In the worst case, a couple of handful cycles to move an object from a node
> to another node.
> 
> 2) Performance comparison with existing l3fwd (The complete static code with
> out any nodes) vs modular l3fwd-graph with 5 nodes (ip4_lookup, ip4_rewrite,
> ethdev_tx, ethdev_rx, pkt_drop).
> Here is graphical representation of the l3fwd-graph as Graphviz dot file:
> http://bit.ly/39UPPGm
> 
> # l3fwd-graph performance is -2.5% wrt static l3fwd.
> 
> # We have simulated the similar test with existing librte_pipeline application
> [4].
> ip_pipline application is -48.62% wrt static l3fwd.
> 
> The above results are on octeontx2. It may vary on other platforms.
> The platforms with higher L1 and L2 caches will have further better
> performance.
> 
> Tested architectures:
> 
> 1) AArch64
> 2) X86
> 
> 
> Graph library Features
> --
> 1) Nodes as plugins
> 2) Support for out of tree nodes
> 3) Multi-process support.
> 4) Low overhead graph walk and node enqueue
> 5) Low overhead statistics collection infrastructure
> 6) Support to export the graph as a Graphviz dot file.
> See rte_graph_export()
> Example of exported graph: http://bit.ly/2PqbqOy
> 7) Allow having another graph walk implementation in the future by
> segregating the fast path and slow path code.
> 
> 
> Advantages of Graph architecture:
> -
> 
> 1) Memory latency is the enemy for high-speed packet processing, moving the
> similar packet processing code to a node will reduce the I cache and D caches
> misses.
> 2) Exploits the probability that most packets will follow the same nodes in 
> the
> graph.
> 3) Allow SIMD instructions for packet processing of the node.
> 4) The modular scheme allows having reusable nodes for the consumers.
> 5) The modular scheme allows us to abstract the vendor HW specific
> optimizations as a node.
> 
> 
> What is different than existing libpipeline library
> ---
> At a very high level, libpipeline created to allow modular plugin interface.
> Based on our analysis the performance is better in the graph model.
> Check the details under the Performance section, Item (2).
> 
> This rte_graph implementation has taken care of fixing some of the
> architecture/implementations limitations with libpipeline.
> 
> 1) Use cases like IP fragmentation, TCP ACK processing (with new TCP data
> sent out in the same context) have a problem as rte_pipeline_run() passes just
> pkt_mask of 64 bits to different tables and packet pointers are stored in the
> single array in struct rte_pipeline_run.
> 
> In Graph architecture, The node has complete control of how many packets
> are output to next node seamlessly.
> 
> 2) Since pktmask is passed to different tables, it takes multiple for loops to
> extract pkts out of fragmented pkts_mask. This makes it difficult to prefetch
> ahead a set of packets. This issue does not exist in Graph architecture.
> 
> 3) Every table have two/three funct

Re: [dpdk-dev] [PATCH v3] regexdev: introduce regexdev subsystem

2020-02-24 Thread Jerin Jacob
> > 4) app/test/test_regexdev.c like app/test/test_eventdev.c
>
> We started to create a super basic app, after the API will be finalized and 
> we will have HW
> we can push it. (if you need it faster than feel free)

A simple Unit test case needs to be present for the APIs. On the
course of developing common code,
it can be developed to test the common code with dummy/skeleton driver.

>
> > 5) Need a maintainer for maintaining the regex subsystem
> >
> We wish to maintain it if you agree.

Yes. Please.

> > >
> > > One more thing, regarding the ops structure, I think it is better to 
> > > split it to 2
> > different
> > > structures one enque and one for dequeue, since there are no real shared
> > data and we will
> > > be able to save memory, what do you think?
> >
> > Ops are allocated from mempool so it will be overhead to manage both.
> > moreover, some
> > of the fields added in req can be used for resp as info. cryptodev
> > follows the similar concept,
> > I think, we can have symmetry with cryptodev wherever is possible to avoid
> > end-user to learn new API models.
>
> True that there will be overhead with 2 mempools (small one)
> but lets assume 255 results. This means that the buffer should be 255 * 
> sizeof(rte_regex_match) = 2K
> also this will enable us to replace groupX with group[] which will allow even 
> more groups.
> In addition don't think that crypto is a good example.
> The main difference is that in RegEx the output is different format then the 
> input.

# IMO, Some of the fields may be useful for a response as well. I
think application may be interested in following
req filed in the response.
a) buf_addr
b) scan_size
c) user_id (This would be main one)

# Having two mempools adds overhead per lcore L1 cache usage and extra
complexity to the application.

# IMO, From a performance perspective, one mempool is good due to less
stress on the cache and it is costly to
add new mempool for HW mempool implementations.

# I think, group[] use case we can add it when it required by
introducing "matches_start_offset" field, which will
tell the req, where is the end of group[] and where "matches" start
with single mempool scheme also.

# I think, one of the other use case for "matches_start_offset" that,
It may possible to put vendor-specific
opaque data. It will be filled by driver on response. The application
can reference the matches as

struct rte_regex_match *matches = RTE_PTR_ADD(ops, ops->matches_start_offset);

>
> > I assume you will send the v4 with these comments. I think, with v4 we
> > can start implementing common library code.
>
> Just need to agree on the split (one more iteration )
> and I will start working on the common code.

Ack.


Re: [dpdk-dev] [PATCH v2 0/7] vfio/pci: SR-IOV support

2020-02-24 Thread Jason Wang



On 2020/2/25 上午10:33, Tian, Kevin wrote:

From: Alex Williamson
Sent: Thursday, February 20, 2020 2:54 AM

Changes since v1 are primarily to patch 3/7 where the commit log is
rewritten, along with option parsing and failure logging based on
upstream discussions.  The primary user visible difference is that
option parsing is now much more strict.  If a vf_token option is
provided that cannot be used, we generate an error.  As a result of
this, opening a PF with a vf_token option will serve as a mechanism of
setting the vf_token.  This seems like a more user friendly API than
the alternative of sometimes requiring the option (VFs in use) and
sometimes rejecting it, and upholds our desire that the option is
always either used or rejected.

This also means that the VFIO_DEVICE_FEATURE ioctl is not the only
means of setting the VF token, which might call into question whether
we absolutely need this new ioctl.  Currently I'm keeping it because I
can imagine use cases, for example if a hypervisor were to support
SR-IOV, the PF device might be opened without consideration for a VF
token and we'd require the hypservisor to close and re-open the PF in
order to set a known VF token, which is impractical.

Series overview (same as provided with v1):

Thanks for doing this!


The synopsis of this series is that we have an ongoing desire to drive
PCIe SR-IOV PFs from userspace with VFIO.  There's an immediate need
for this with DPDK drivers and potentially interesting future use

Can you provide a link to the DPDK discussion?


cases in virtualization.  We've been reluctant to add this support
previously due to the dependency and trust relationship between the
VF device and PF driver.  Minimally the PF driver can induce a denial
of service to the VF, but depending on the specific implementation,
the PF driver might also be responsible for moving data between VFs
or have direct access to the state of the VF, including data or state
otherwise private to the VF or VF driver.

Just a loud thinking. While the motivation of VF token sounds reasonable
to me, I'm curious why the same concern is not raised in other usages.
For example, there is no such design in virtio framework, where the
virtio device could also be restarted, putting in separate process (vhost-user),
and even in separate VM (virtio-vhost-user), etc.



AFAIK, the restart could only be triggered by either VM or qemu. But 
yes, the datapath could be offloaded.


But I'm not sure introducing another dedicated mechanism is better than 
using the exist generic POSIX mechanism to make sure the connection 
(AF_UINX) is secure.




  Of course the para-
virtualized attribute of virtio implies some degree of trust, but as you
mentioned many SR-IOV implementations support VF->PF communication
which also implies some level of trust. It's perfectly fine if VFIO just tries
to do better than other sub-systems, but knowing how other people
tackle the similar problem may make the whole picture clearer. 😊

+Jason.



I'm not quite sure e.g allowing userspace PF driver with kernel VF 
driver would not break the assumption of kernel security model. At least 
we should forbid a unprivileged PF driver running in userspace.


Thanks



Re: [dpdk-dev] [RFC PATCH 0/5] graph: introduce graph subsystem

2020-02-24 Thread Jerin Jacob
On Tue, Feb 25, 2020 at 10:53 AM Honnappa Nagarahalli
 wrote:

> > 2) Based on our experience, NPU HW accelerates are so different than one
> > vendor to another vendor. Going forward, We believe, API abstraction may
> > not be enough abstract the difference in HW. The Vendor-specific nodes can
> > abstract the HW differences and reuse generic the nodes as needed.
> > This would help both the silicon vendors and DPDK end users.
> If you are proposing this as a new way to provide HW abstractions, then we 
> will be restricting the application programming model to follow graph 
> subsystem. IMO, the HW abstractions should be available irrespective of the 
> programming model.
> Graph model of packet processing might not be applicable for all use cases.

No, I am not proposing this is the new way to provide HW abstraction
in DPDK. API based HW abstraction will continue as it was done
earlier.


[dpdk-dev] speed devarg for virtio driver

2020-02-24 Thread Ivan Dyukov


[PATCH v4 1/4] net/virtio: refactor devargs parsing
[PATCH v4 2/4] net/virtio: add link speed devarg
[PATCH v4 3/4] net/virtio-user: fix devargs parsing
[PATCH v4 4/4] net/virtio-user: adding link speed devarg
---
v4 changes:
* link_speed renamed to speed devarg
* speed devarg is added to virtio-user driver

v3 changes:
* link_speed devarg is added to virtio documentation



[dpdk-dev] [PATCH v4 4/4] net/virtio-user: adding link speed devarg

2020-02-24 Thread Ivan Dyukov
virtio driver already parses speed devarg. virtio-user should add
it to list of valid devargs and call eth_virtio_dev_init function
which init speed value.

eth_virtio_dev_init already is called from virtio_user_pmd_probe
function. The only change is required to enable speed devargs:
adding speed to list of valid devargs.

Signed-off-by: Ivan Dyukov 
---
 doc/guides/nics/virtio.rst  | 8 
 drivers/net/virtio/virtio_user_ethdev.c | 5 -
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst
index 0341907ef..6286286db 100644
--- a/doc/guides/nics/virtio.rst
+++ b/doc/guides/nics/virtio.rst
@@ -410,6 +410,14 @@ Below devargs are supported by the virtio-user vdev:
 It is used to enable virtio device packed virtqueue feature.
 (Default: 0 (disabled))
 
+#.  ``speed``:
+
+It is used to specify link speed of virtio device. Link speed is a part of
+link status structure. It could be requested by application using
+rte_eth_link_get_nowait function.
+(Default: 1 (10G))
+
+
 Virtio paths Selection and Usage
 
 
diff --git a/drivers/net/virtio/virtio_user_ethdev.c 
b/drivers/net/virtio/virtio_user_ethdev.c
index 074527714..45c1541c5 100644
--- a/drivers/net/virtio/virtio_user_ethdev.c
+++ b/drivers/net/virtio/virtio_user_ethdev.c
@@ -406,6 +406,8 @@ static const char *valid_args[] = {
VIRTIO_USER_ARG_IN_ORDER,
 #define VIRTIO_USER_ARG_PACKED_VQ  "packed_vq"
VIRTIO_USER_ARG_PACKED_VQ,
+#define VIRTIO_USER_ARG_SPEED  "speed"
+   VIRTIO_USER_ARG_SPEED,
NULL
 };
 
@@ -738,4 +740,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_virtio_user,
"server=<0|1> "
"mrg_rxbuf=<0|1> "
"in_order=<0|1> "
-   "packed_vq=<0|1>");
+   "packed_vq=<0|1> "
+   "speed=");
-- 
2.17.1



[dpdk-dev] [PATCH v4 2/4] net/virtio: add link speed devarg

2020-02-24 Thread Ivan Dyukov
Some applications like pktgen use link_speed to calculate
transmit rate. It limits outcome traffic to hardcoded 10G.

This patch adds link_speed devarg which allows to configure
link_speed of virtio device.

Signed-off-by: Ivan Dyukov 
---
 doc/guides/nics/virtio.rst |  7 +++
 drivers/net/virtio/virtio_ethdev.c | 97 +-
 drivers/net/virtio/virtio_pci.h|  1 +
 3 files changed, 89 insertions(+), 16 deletions(-)

diff --git a/doc/guides/nics/virtio.rst b/doc/guides/nics/virtio.rst
index d1f5fb898..0341907ef 100644
--- a/doc/guides/nics/virtio.rst
+++ b/doc/guides/nics/virtio.rst
@@ -356,6 +356,13 @@ Below devargs are supported by the PCI virtio driver:
 a virtio device needs to work in vDPA mode.
 (Default: 0 (disabled))
 
+#.  ``speed``:
+
+It is used to specify link speed of virtio device. Link speed is a part of
+link status structure. It could be requested by application using
+rte_eth_link_get_nowait function.
+(Default: 1 (10G))
+
 Below devargs are supported by the virtio-user vdev:
 
 #.  ``path``:
diff --git a/drivers/net/virtio/virtio_ethdev.c 
b/drivers/net/virtio/virtio_ethdev.c
index 22323d9a5..9707c8cbc 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -45,6 +45,10 @@ static int virtio_dev_promiscuous_enable(struct rte_eth_dev 
*dev);
 static int virtio_dev_promiscuous_disable(struct rte_eth_dev *dev);
 static int virtio_dev_allmulticast_enable(struct rte_eth_dev *dev);
 static int virtio_dev_allmulticast_disable(struct rte_eth_dev *dev);
+static uint32_t virtio_dev_speed_capa_get(uint32_t speed);
+static int virtio_dev_devargs_parse(struct rte_devargs *devargs,
+   int *vdpa,
+   uint32_t *speed);
 static int virtio_dev_info_get(struct rte_eth_dev *dev,
struct rte_eth_dev_info *dev_info);
 static int virtio_dev_link_update(struct rte_eth_dev *dev,
@@ -1861,6 +1865,7 @@ int
 eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
 {
struct virtio_hw *hw = eth_dev->data->dev_private;
+   uint32_t speed = ETH_SPEED_NUM_10G;
int ret;
 
if (sizeof(struct virtio_net_hdr_mrg_rxbuf) > RTE_PKTMBUF_HEADROOM) {
@@ -1886,7 +1891,11 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
 
return 0;
}
-
+   ret = virtio_dev_devargs_parse(eth_dev->device->devargs,
+NULL, &speed);
+   if (ret < 0)
+   return ret;
+   hw->speed = speed;
/*
 * Pass the information to the rte_eth_dev_close() that it should also
 * release the private port resources.
@@ -1954,6 +1963,7 @@ eth_virtio_dev_uninit(struct rte_eth_dev *eth_dev)
return 0;
 }
 
+
 static int vdpa_check_handler(__rte_unused const char *key,
const char *value, void *ret_val)
 {
@@ -1965,33 +1975,89 @@ static int vdpa_check_handler(__rte_unused const char 
*key,
return 0;
 }
 
+
+static uint32_t
+virtio_dev_speed_capa_get(uint32_t speed)
+{
+   switch (speed) {
+   case ETH_SPEED_NUM_10G:
+   return ETH_LINK_SPEED_10G;
+   case ETH_SPEED_NUM_20G:
+   return ETH_LINK_SPEED_20G;
+   case ETH_SPEED_NUM_25G:
+   return ETH_LINK_SPEED_25G;
+   case ETH_SPEED_NUM_40G:
+   return ETH_LINK_SPEED_40G;
+   case ETH_SPEED_NUM_50G:
+   return ETH_LINK_SPEED_50G;
+   case ETH_SPEED_NUM_56G:
+   return ETH_LINK_SPEED_56G;
+   case ETH_SPEED_NUM_100G:
+   return ETH_LINK_SPEED_100G;
+   default:
+   return 0;
+   }
+}
+
+
+#define VIRTIO_ARG_SPEED  "speed"
+#define VIRTIO_ARG_VDPA   "vdpa"
+
+
+static int link_speed_handler(const char *key __rte_unused,
+   const char *value, void *ret_val)
+{
+   uint32_t val;
+   if (!value || !ret_val)
+   return -EINVAL;
+   val = strtoul(value, NULL, 0);
+   /* validate input */
+   if (virtio_dev_speed_capa_get(val) == 0)
+   return -EINVAL;
+   *(uint32_t *)ret_val = val;
+
+   return 0;
+}
+
+
 static int
-virtio_dev_devargs_parse(struct rte_devargs *devargs, int *vdpa)
+virtio_dev_devargs_parse(struct rte_devargs *devargs, int *vdpa,
+   uint32_t *speed)
 {
struct rte_kvargs *kvlist;
-   const char *key = "vdpa";
int ret = 0;
 
if (devargs == NULL)
return 0;
 
kvlist = rte_kvargs_parse(devargs->args, NULL);
-   if (kvlist == NULL)
+   if (kvlist == NULL) {
+   PMD_INIT_LOG(ERR, "error when parsing param");
return 0;
-
-   if (!rte_kvargs_count(kvlist, key))
-   goto exit;
-
-   if (vdpa) {
+   }
+   if (vdpa && rte_kvargs_count(kvlist, VIRTIO_ARG_VDPA) == 1) {
/* vdpa mode selected when there's a key-value pair:
 * vdpa=1
 */
-   ret = rte_kvargs_process(kvlist, key,
+

[dpdk-dev] [PATCH v4 3/4] net/virtio-user: fix devargs parsing

2020-02-24 Thread Ivan Dyukov
strtoull returns 0 if it fails to parse input string. It's ignored
in get_integer_arg.

This patch handles error cases for strtoull function.

Signed-off-by: Ivan Dyukov 
---
 drivers/net/virtio/virtio_user_ethdev.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/net/virtio/virtio_user_ethdev.c 
b/drivers/net/virtio/virtio_user_ethdev.c
index 3fc172573..074527714 100644
--- a/drivers/net/virtio/virtio_user_ethdev.c
+++ b/drivers/net/virtio/virtio_user_ethdev.c
@@ -433,12 +433,17 @@ static int
 get_integer_arg(const char *key __rte_unused,
const char *value, void *extra_args)
 {
+   uint64_t integer = 0;
if (!value || !extra_args)
return -EINVAL;
-
-   *(uint64_t *)extra_args = strtoull(value, NULL, 0);
-
-   return 0;
+   errno = 0;
+   integer = strtoull(value, NULL, 0);
+   /* extra_args keeps default value, it should be replaced
+* only in case of successful parsing of the 'value' arg
+*/
+   if (errno == 0)
+   *(uint64_t *)extra_args = integer;
+   return -errno;
 }
 
 static struct rte_eth_dev *
-- 
2.17.1



[dpdk-dev] [PATCH v4 1/4] net/virtio: refactor devargs parsing

2020-02-24 Thread Ivan Dyukov
refactor vdpa specific devargs parsing to more generic way

Signed-off-by: Ivan Dyukov 
---
 drivers/net/virtio/virtio_ethdev.c | 35 +-
 1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/drivers/net/virtio/virtio_ethdev.c 
b/drivers/net/virtio/virtio_ethdev.c
index 044eb10a7..22323d9a5 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -1955,16 +1955,18 @@ eth_virtio_dev_uninit(struct rte_eth_dev *eth_dev)
 }
 
 static int vdpa_check_handler(__rte_unused const char *key,
-   const char *value, __rte_unused void *opaque)
+   const char *value, void *ret_val)
 {
-   if (strcmp(value, "1"))
-   return -1;
+   if (strcmp(value, "1") == 0)
+   *(int *)ret_val = 1;
+   else
+   *(int *)ret_val = 0;
 
return 0;
 }
 
 static int
-vdpa_mode_selected(struct rte_devargs *devargs)
+virtio_dev_devargs_parse(struct rte_devargs *devargs, int *vdpa)
 {
struct rte_kvargs *kvlist;
const char *key = "vdpa";
@@ -1980,12 +1982,16 @@ vdpa_mode_selected(struct rte_devargs *devargs)
if (!rte_kvargs_count(kvlist, key))
goto exit;
 
-   /* vdpa mode selected when there's a key-value pair: vdpa=1 */
-   if (rte_kvargs_process(kvlist, key,
-   vdpa_check_handler, NULL) < 0) {
-   goto exit;
+   if (vdpa) {
+   /* vdpa mode selected when there's a key-value pair:
+* vdpa=1
+*/
+   ret = rte_kvargs_process(kvlist, key,
+   vdpa_check_handler, vdpa);
+   if (ret  < 0)
+   goto exit;
}
-   ret = 1;
+
 
 exit:
rte_kvargs_free(kvlist);
@@ -1995,8 +2001,17 @@ vdpa_mode_selected(struct rte_devargs *devargs)
 static int eth_virtio_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
struct rte_pci_device *pci_dev)
 {
+   int vdpa = 0;
+   int ret = 0;
+
+   ret = virtio_dev_devargs_parse(pci_dev->device.devargs, &vdpa);
+   if (ret < 0) {
+   PMD_DRV_LOG(ERR,
+   "devargs parsing is failed");
+   return ret;
+   }
/* virtio pmd skips probe if device needs to work in vdpa mode */
-   if (vdpa_mode_selected(pci_dev->device.devargs))
+   if (vdpa == 1)
return 1;
 
return rte_eth_dev_pci_generic_probe(pci_dev, sizeof(struct virtio_hw),
-- 
2.17.1



[dpdk-dev] [PATCH v1] ci: add test suite run without hugepage

2020-02-24 Thread Ruifeng Wang
This test suite is derived from fast-tests suite. Cases in this
suite are run with '--no-huge' flag.

The suite aims to cover as many as possible test cases out of the
fast-tests suites in the environments without huge pages support,
like containers.

Signed-off-by: Ruifeng Wang 
Reviewed-by: Gavin Hu 
---
 .ci/linux-build.sh   |  4 +++
 .travis.yml  | 12 +++
 app/test/meson.build | 75 
 3 files changed, 91 insertions(+)

diff --git a/.ci/linux-build.sh b/.ci/linux-build.sh
index d500c4c00..39515d915 100755
--- a/.ci/linux-build.sh
+++ b/.ci/linux-build.sh
@@ -92,3 +92,7 @@ fi
 if [ "$RUN_TESTS" = "1" ]; then
 sudo meson test -C build --suite fast-tests -t 3
 fi
+
+if [ "$RUN_TESTS_NO_HUGE" = "1" ]; then
+sudo meson test -C build --suite nohuge-tests -t 3
+fi
diff --git a/.travis.yml b/.travis.yml
index b64a81bd0..0e07d52d0 100644
--- a/.travis.yml
+++ b/.travis.yml
@@ -43,6 +43,9 @@ jobs:
   - env: DEF_LIB="shared" RUN_TESTS=1
 arch: amd64
 compiler: gcc
+  - env: DEF_LIB="shared" RUN_TESTS_NO_HUGE=1
+arch: amd64
+compiler: gcc
   - env: DEF_LIB="shared" BUILD_DOCS=1
 arch: amd64
 compiler: gcc
@@ -66,6 +69,9 @@ jobs:
   - env: DEF_LIB="shared" RUN_TESTS=1
 arch: amd64
 compiler: clang
+  - env: DEF_LIB="shared" RUN_TESTS_NO_HUGE=1
+arch: amd64
+compiler: clang
   - env: DEF_LIB="shared" BUILD_DOCS=1
 arch: amd64
 compiler: clang
@@ -101,6 +107,9 @@ jobs:
   - env: DEF_LIB="static"
 arch: arm64
 compiler: gcc
+  - env: DEF_LIB="shared" RUN_TESTS_NO_HUGE=1
+arch: arm64
+compiler: gcc
   - env: DEF_LIB="shared" BUILD_DOCS=1
 arch: arm64
 compiler: gcc
@@ -124,3 +133,6 @@ jobs:
   - env: DEF_LIB="shared"
 arch: arm64
 compiler: clang
+  - env: DEF_LIB="shared" RUN_TESTS_NO_HUGE=1
+arch: arm64
+compiler: clang
diff --git a/app/test/meson.build b/app/test/meson.build
index 0a2ce710f..43a23d351 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -237,6 +237,64 @@ fast_test_names = [
 'thash_autotest',
 ]
 
+nohuge_test_names = [
+'alarm_autotest',
+'byteorder_autotest',
+'cmdline_autotest',
+'common_autotest',
+'cpuflags_autotest',
+'cycles_autotest',
+'debug_autotest',
+'eal_flags_n_opt_autotest',
+'eal_flags_no_huge_autotest',
+'eal_flags_vdev_opt_autotest',
+'eal_fs_autotest',
+'errno_autotest',
+'event_ring_autotest',
+'fib_autotest',
+'fib6_autotest',
+'interrupt_autotest',
+'logs_autotest',
+'lpm_autotest',
+'lpm6_autotest',
+'mbuf_autotest',
+'memcpy_autotest',
+'memory_autotest',
+'meter_autotest',
+'multiprocess_autotest',
+'per_lcore_autotest',
+'prefetch_autotest',
+'rcu_qsbr_autotest',
+'red_autotest',
+'rib_autotest',
+'rib6_autotest',
+'ring_autotest',
+'rwlock_rda_autotest',
+'rwlock_rds_wrm_autotest',
+'rwlock_rde_wro_autotest',
+'sched_autotest',
+'spinlock_autotest',
+'string_autotest',
+'tailq_autotest',
+'user_delay_us',
+'version_autotest',
+'crc_autotest',
+'delay_us_sleep_autotest',
+'eventdev_common_autotest',
+'fbarray_autotest',
+'ipsec_autotest',
+'kni_autotest',
+'kvargs_autotest',
+'member_autotest',
+'metrics_autotest',
+'power_cpufreq_autotest',
+'power_autotest',
+'power_kvm_vm_autotest',
+'reorder_autotest',
+'service_autotest',
+'thash_autotest',
+]
+
 perf_test_names = [
 'ring_perf_autotest',
 'mempool_perf_autotest',
@@ -341,6 +399,10 @@ if dpdk_conf.has('RTE_LIBRTE_RING_PMD')
fast_test_names += 'latencystats_autotest'
driver_test_names += 'link_bonding_mode4_autotest'
fast_test_names += 'pdump_autotest'
+   nohuge_test_names += 'ring_pmd_autotest'
+   nohuge_test_names += 'bitratestats_autotest'
+   nohuge_test_names += 'latencystats_autotest'
+   nohuge_test_names += 'pdump_autotest'
 endif
 
 if dpdk_conf.has('RTE_LIBRTE_POWER')
@@ -430,6 +492,19 @@ foreach arg : fast_test_names
endif
 endforeach
 
+foreach arg : nohuge_test_names
+   if host_machine.system() == 'linux'
+   test(arg, dpdk_test,
+ env : ['DPDK_TEST=' + arg],
+ args : test_args +
+['--no-huge'] + ['-m 1024'] +
+['--file-prefix=@0@'.format(arg)],
+   timeout : timeout_seconds_fast,
+   is_parallel : false,
+   suite : 'nohuge-tests')
+   endif
+endforeach
+
 foreach arg : perf_test_names
test(arg, dpdk_test,
env : ['DPDK_TEST=' + arg],
-- 
2.17.1



Re: [dpdk-dev] [PATCH v3] regexdev: introduce regexdev subsystem

2020-02-24 Thread Ori Kam


> -Original Message-
> From: Jerin Jacob 
> Sent: Tuesday, February 25, 2020 7:57 AM
> To: Ori Kam 
> Cc: Jerin Jacob ; xiang.w.w...@intel.com; dpdk-dev
> ; Pavan Nikhilesh ; Shahaf
> Shuler ; Hemant Agrawal
> ; Opher Reviv ; Alex
> Rosenbaum ; dov...@marvell.com; Prasun Kapoor
> ; Nipun Gupta ; Richardson,
> Bruce ; yang.a.h...@intel.com;
> harry.ch...@intel.com; gu.ji...@zte.com.cn; shanjia...@chinatelecom.cn;
> zhangy@chinatelecom.cn; lixin...@huachentel.com; wush...@inspur.com;
> yuying...@yxlink.com; fanchengg...@sunyainfo.com;
> davidf...@tencent.com; liuzho...@chinaunicom.cn;
> zhaoyon...@huawei.com; o...@yunify.com; j...@netgate.com;
> hongjun...@intel.com; j.bromh...@titan-ic.com; d...@ntop.org;
> f...@napatech.com; arthur...@lionic.com; Thomas Monjalon
> 
> Subject: Re: [dpdk-dev] [PATCH v3] regexdev: introduce regexdev subsystem
> 
> > > 4) app/test/test_regexdev.c like app/test/test_eventdev.c
> >
> > We started to create a super basic app, after the API will be finalized and 
> > we
> will have HW
> > we can push it. (if you need it faster than feel free)
> 
> A simple Unit test case needs to be present for the APIs. On the
> course of developing common code,
> it can be developed to test the common code with dummy/skeleton driver.
> 

Agree this is what we are currently have.

> >
> > > 5) Need a maintainer for maintaining the regex subsystem
> > >
> > We wish to maintain it if you agree.
> 
> Yes. Please.
> 

Great.

> > > >
> > > > One more thing, regarding the ops structure, I think it is better to 
> > > > split it
> to 2
> > > different
> > > > structures one enque and one for dequeue, since there are no real shared
> > > data and we will
> > > > be able to save memory, what do you think?
> > >
> > > Ops are allocated from mempool so it will be overhead to manage both.
> > > moreover, some
> > > of the fields added in req can be used for resp as info. cryptodev
> > > follows the similar concept,
> > > I think, we can have symmetry with cryptodev wherever is possible to avoid
> > > end-user to learn new API models.
> >
> > True that there will be overhead with 2 mempools (small one)
> > but lets assume 255 results. This means that the buffer should be 255 *
> sizeof(rte_regex_match) = 2K
> > also this will enable us to replace groupX with group[] which will allow 
> > even
> more groups.
> > In addition don't think that crypto is a good example.
> > The main difference is that in RegEx the output is different format then the
> input.
> 
> # IMO, Some of the fields may be useful for a response as well. I
> think application may be interested in following
> req filed in the response.
> a) buf_addr

I don't see how this can be used in the response. since if working in out of 
order result.
you don’t know which result will be returned. 
I also think it is error prone to use the same op for the enqueue and dequeue.

> b) scan_size

Please see above.

> c) user_id (This would be main one)

Agree

> 
> # Having two mempools adds overhead per lcore L1 cache usage and extra
> complexity to the application.
> 
> # IMO, From a performance perspective, one mempool is good due to less
> stress on the cache and it is costly to
> add new mempool for HW mempool implementations.
> 
> # I think, group[] use case we can add it when it required by
> introducing "matches_start_offset" field, which will
> tell the req, where is the end of group[] and where "matches" start
> with single mempool scheme also.
> 
> # I think, one of the other use case for "matches_start_offset" that,
> It may possible to put vendor-specific
> opaque data. It will be filled by driver on response. The application
> can reference the matches as
> 
> struct rte_regex_match *matches = RTE_PTR_ADD(ops, ops-
> 
>matches_start_offset);
> 

O.K for now we will keep  it as is, and we will see what will be in the future.

> >
> > > I assume you will send the v4 with these comments. I think, with v4 we
> > > can start implementing common library code.
> >
> > Just need to agree on the split (one more iteration )
> > and I will start working on the common code.
> 
> Ack.

Great,
I'm starting to work on V4 with all comments so the RFC will be acked and then 
will start 
coding the rest of the common code.