RE: DPDK community: RTE_FLOW support for P4-programmable devices

2023-08-27 Thread Ori Kam
Hi,
Since I was on vacation, can someone please send the meeting summary?

Thanks,
Ori

> -Original Message-
> From: Dumitrescu, Cristian 
> Sent: Wednesday, August 16, 2023 7:20 PM
> To: Morten Brørup ; Zhang, Qi Z
> ; Ori Kam ; Jerin Jacob
> 
> Cc: NBU-Contact-Thomas Monjalon (EXTERNAL) ;
> david.march...@redhat.com; Richardson, Bruce
> ; jer...@marvell.com; ferruh.yi...@amd.com;
> techbo...@dpdk.org; Mcnamara, John ; Zhang,
> Helin ; dev@dpdk.org
> Subject: DPDK community: RTE_FLOW support for P4-programmable devices
> 
> Hi folks,
> 
> We just set up a community call for next week to discuss in more details the
> proposal for RTE_FLOW extensions to support P4-programmable devices
> https://mails.dpdk.org/archives/dev/2023-August/273703.html and look for
> ways to converge and make progress.
> 
> All the people from To: and CC: are already invited. To avoid cluttering 
> people's
> calendars, I did not add dev@dpdk.org, so if anybody else wants to attend,
> please send me a private email and I will be happy to forward the invite.
> 
> Thanks,
> Cristian


[PATCH v2 1/2] ethdev: add new symmetric hash function

2023-08-27 Thread Xueming Li
The new symmetric hash function swap src/dst L3 address and
L4 ports automatically by sorting.

Signed-off-by: Xueming Li 
---
 lib/ethdev/rte_flow.h | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/lib/ethdev/rte_flow.h b/lib/ethdev/rte_flow.h
index 2ebb76dbc0..4f4421ca50 100644
--- a/lib/ethdev/rte_flow.h
+++ b/lib/ethdev/rte_flow.h
@@ -3196,6 +3196,13 @@ enum rte_eth_hash_function {
 * src or dst address will xor with zero pair.
 */
RTE_ETH_HASH_FUNCTION_SYMMETRIC_TOEPLITZ,
+   /**
+* Symmetric Toeplitz: L3 and L4 fields are sorted prior to
+* the hash function.
+*  If src_ip > dst_ip, swap src_ip and dst_ip.
+*  If src_port > dst_port, swap src_port and dst_port.
+*/
+   RTE_ETH_HASH_FUNCTION_SYMMETRIC_TOEPLITZ_SORT,
RTE_ETH_HASH_FUNCTION_MAX,
 };
 
-- 
2.25.1



[PATCH v2 2/2] net/mlx5: support new RSS symmetric hash function

2023-08-27 Thread Xueming Li
This patch supports the new RSS symmetric hash function:
RTE_ETH_HASH_FUNCTION_SYMMETRIC_TOEPLITZ_SORT

The new hash function makes symmetric hash result by swapping
the source and destination IP and L4 port automatically.

Signed-off-by: Xueming Li 
---
 drivers/net/mlx5/mlx5_flow.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_flow.h b/drivers/net/mlx5/mlx5_flow.h
index 3a97975d69..e162baa7da 100644
--- a/drivers/net/mlx5/mlx5_flow.h
+++ b/drivers/net/mlx5/mlx5_flow.h
@@ -1456,8 +1456,8 @@ struct rte_flow_template_table {
 #define MLX5_RSS_HASH_NONE 0ULL
 
 #define MLX5_RSS_IS_SYMM(func) \
-   ((func) == RTE_ETH_HASH_FUNCTION_SYMMETRIC_TOEPLITZ)
-
+   (((func) == RTE_ETH_HASH_FUNCTION_SYMMETRIC_TOEPLITZ) || \
+((func) == RTE_ETH_HASH_FUNCTION_SYMMETRIC_TOEPLITZ_SORT))
 
 /* extract next protocol type from Ethernet & VLAN headers */
 #define MLX5_ETHER_TYPE_FROM_HEADER(_s, _m, _itm, _prt) do { \
-- 
2.25.1



[RFC] cache guard

2023-08-27 Thread Morten Brørup
+CC Honnappa and Konstantin, Ring lib maintainers
+CC Mattias, PRNG lib maintainer

> From: Bruce Richardson [mailto:bruce.richard...@intel.com]
> Sent: Friday, 25 August 2023 11.24
> 
> On Fri, Aug 25, 2023 at 11:06:01AM +0200, Morten Brørup wrote:
> > +CC mempool maintainers
> >
> > > From: Bruce Richardson [mailto:bruce.richard...@intel.com]
> > > Sent: Friday, 25 August 2023 10.23
> > >
> > > On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Brørup wrote:
> > > > Bruce,
> > > >
> > > > With this patch [1], it is noted that the ring producer and
> consumer data
> > > should not be on adjacent cache lines, for performance reasons.
> > > >
> > > > [1]:
> > >
> https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1f
> fd4b66
> > > e75485cc8b63b9aedfbdfe8b0
> > > >
> > > > (It's obvious that they cannot share the same cache line, because
> they are
> > > accessed by two different threads.)
> > > >
> > > > Intuitively, I would think that having them on different cache
> lines would
> > > suffice. Why does having an empty cache line between them make a
> difference?
> > > >
> > > > And does it need to be an empty cache line? Or does it suffice
> having the
> > > second structure start at two cache lines after the start of the
> first
> > > structure (e.g. if the size of the first structure is two cache
> lines)?
> > > >
> > > > I'm asking because the same principle might apply to other code
> too.
> > > >
> > > Hi Morten,
> > >
> > > this was something we discovered when working on the distributor
> library.
> > > If we have cachelines per core where there is heavy access, having
> some
> > > cachelines as a gap between the content cachelines can help
> performance. We
> > > believe this helps due to avoiding issues with the HW prefetchers
> (e.g.
> > > adjacent cacheline prefetcher) bringing in the second cacheline
> > > speculatively when an operation is done on the first line.
> >
> > I guessed that it had something to do with speculative prefetching,
> but wasn't sure. Good to get confirmation, and that it has a measureable
> effect somewhere. Very interesting!
> >
> > NB: More comments in the ring lib about stuff like this would be nice.
> >
> > So, for the mempool lib, what do you think about applying the same
> technique to the rte_mempool_debug_stats structure (which is an array
> indexed per lcore)... Two adjacent lcores heavily accessing their local
> mempool caches seems likely to me. But how heavy does the access need to
> be for this technique to be relevant?
> >
> 
> No idea how heavy the accesses need to be for this to have a noticable
> effect. For things like debug stats, I wonder how worthwhile making such
> a
> change would be, but then again, any change would have very low impact
> too
> in that case.

I just tried adding padding to some of the hot structures in our own 
application, and observed a significant performance improvement for those.

So I think this technique should have higher visibility in DPDK by adding a new 
cache macro to rte_common.h:

/**
 * Empty cache line, to guard against speculative prefetching.
 *
 * Use as spacing between data accessed by different lcores,
 * to prevent cache thrashing on CPUs with speculative prefetching.
 */
#define RTE_CACHE_GUARD(name) char cache_guard_##name[RTE_CACHE_LINE_SIZE] 
__rte_cache_aligned;

To be used like this:

struct rte_ring {
char name[RTE_RING_NAMESIZE] __rte_cache_aligned;
/**< Name of the ring. */
int flags;   /**< Flags supplied at creation. */
const struct rte_memzone *memzone;
/**< Memzone, if any, containing the rte_ring */
uint32_t size;   /**< Size of ring. */
uint32_t mask;   /**< Mask (size-1) of ring. */
uint32_t capacity;   /**< Usable size of ring */

-   char pad0 __rte_cache_aligned; /**< empty cache line */
+   RTE_CACHE_GUARD(prod);  /**< Isolate producer status. */

/** Ring producer status. */
union {
struct rte_ring_headtail prod;
struct rte_ring_hts_headtail hts_prod;
struct rte_ring_rts_headtail rts_prod;
}  __rte_cache_aligned;

-   char pad1 __rte_cache_aligned; /**< empty cache line */
+   RTE_CACHE_GUARD(both);  /**< Isolate producer from consumer. */

/** Ring consumer status. */
union {
struct rte_ring_headtail cons;
struct rte_ring_hts_headtail hts_cons;
struct rte_ring_rts_headtail rts_cons;
}  __rte_cache_aligned;

-   char pad2 __rte_cache_aligned; /**< empty cache line */
+   RTE_CACHE_GUARD(cons);  /**< Isolate consumer status. */
};


And for the mempool library:

#ifdef RTE_LIBRTE_MEMPOOL_STATS
/**
 * A structure that stores the mempool statistics (per-lcore).
 * Note: Cache stats (put_cache_bulk/objs, get_cache_bulk/objs) are not
 * captured since they can be calculated from other stats.
 * For exa

Re: [RFC] cache guard

2023-08-27 Thread Mattias Rönnblom

On 2023-08-27 10:34, Morten Brørup wrote:

+CC Honnappa and Konstantin, Ring lib maintainers
+CC Mattias, PRNG lib maintainer


From: Bruce Richardson [mailto:bruce.richard...@intel.com]
Sent: Friday, 25 August 2023 11.24

On Fri, Aug 25, 2023 at 11:06:01AM +0200, Morten Brørup wrote:

+CC mempool maintainers


From: Bruce Richardson [mailto:bruce.richard...@intel.com]
Sent: Friday, 25 August 2023 10.23

On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Brørup wrote:

Bruce,

With this patch [1], it is noted that the ring producer and

consumer data

should not be on adjacent cache lines, for performance reasons.


[1]:



https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1f
fd4b66

e75485cc8b63b9aedfbdfe8b0


(It's obvious that they cannot share the same cache line, because

they are

accessed by two different threads.)


Intuitively, I would think that having them on different cache

lines would

suffice. Why does having an empty cache line between them make a

difference?


And does it need to be an empty cache line? Or does it suffice

having the

second structure start at two cache lines after the start of the

first

structure (e.g. if the size of the first structure is two cache

lines)?


I'm asking because the same principle might apply to other code

too.



Hi Morten,

this was something we discovered when working on the distributor

library.

If we have cachelines per core where there is heavy access, having

some

cachelines as a gap between the content cachelines can help

performance. We

believe this helps due to avoiding issues with the HW prefetchers

(e.g.

adjacent cacheline prefetcher) bringing in the second cacheline
speculatively when an operation is done on the first line.


I guessed that it had something to do with speculative prefetching,

but wasn't sure. Good to get confirmation, and that it has a measureable
effect somewhere. Very interesting!


NB: More comments in the ring lib about stuff like this would be nice.

So, for the mempool lib, what do you think about applying the same

technique to the rte_mempool_debug_stats structure (which is an array
indexed per lcore)... Two adjacent lcores heavily accessing their local
mempool caches seems likely to me. But how heavy does the access need to
be for this technique to be relevant?




No idea how heavy the accesses need to be for this to have a noticable
effect. For things like debug stats, I wonder how worthwhile making such
a
change would be, but then again, any change would have very low impact
too
in that case.


I just tried adding padding to some of the hot structures in our own 
application, and observed a significant performance improvement for those.

So I think this technique should have higher visibility in DPDK by adding a new 
cache macro to rte_common.h:

/**
  * Empty cache line, to guard against speculative prefetching.
  *


"to guard against false sharing-like effects on systems with a 
next-N-lines hardware prefetcher"



  * Use as spacing between data accessed by different lcores,
  * to prevent cache thrashing on CPUs with speculative prefetching.
  */
#define RTE_CACHE_GUARD(name) char cache_guard_##name[RTE_CACHE_LINE_SIZE] 
__rte_cache_aligned;



You could have a macro which specified how much guarding there needs to 
be, ideally defined on a per-CPU basis. (These things has nothing to do 
with the ISA, but everything to do with the implementation.)


I'm not sure N is always 1.

So the guard padding should be RTE_CACHE_LINE_SIZE * 
RTE_CACHE_GUARD_LINES bytes, and wrap the whole thing in

#if RTE_CACHE_GUARD_LINES > 0
#endif

...so you can disable this (cute!) hack (on custom DPDK builds) in case 
you have disabled hardware prefetching, which seems generally to be a 
good idea for packet processing type applications.


...which leads me to another suggestions: add a note on disabling 
hardware prefetching in the optimization guide.


Seems like a very good idea to have this in , and 
otherwise make this issue visible and known.



To be used like this:

struct rte_ring {
char name[RTE_RING_NAMESIZE] __rte_cache_aligned;
/**< Name of the ring. */
int flags;   /**< Flags supplied at creation. */
const struct rte_memzone *memzone;
/**< Memzone, if any, containing the rte_ring */
uint32_t size;   /**< Size of ring. */
uint32_t mask;   /**< Mask (size-1) of ring. */
uint32_t capacity;   /**< Usable size of ring */

-   char pad0 __rte_cache_aligned; /**< empty cache line */
+   RTE_CACHE_GUARD(prod);  /**< Isolate producer status. */

/** Ring producer status. */
union {
struct rte_ring_headtail prod;
struct rte_ring_hts_headtail hts_prod;
struct rte_ring_rts_headtail rts_prod;
}  __rte_cache_aligned;

-   char pad1 __rte_cache_aligned; /**< empty cache line */
+   RTE_CACHE_GUARD(both);  

RE: [RFC] cache guard

2023-08-27 Thread Morten Brørup
> From: Mattias Rönnblom [mailto:hof...@lysator.liu.se]
> Sent: Sunday, 27 August 2023 15.55
> 
> On 2023-08-27 10:34, Morten Brørup wrote:
> > +CC Honnappa and Konstantin, Ring lib maintainers
> > +CC Mattias, PRNG lib maintainer
> >
> >> From: Bruce Richardson [mailto:bruce.richard...@intel.com]
> >> Sent: Friday, 25 August 2023 11.24
> >>
> >> On Fri, Aug 25, 2023 at 11:06:01AM +0200, Morten Brørup wrote:
> >>> +CC mempool maintainers
> >>>
>  From: Bruce Richardson [mailto:bruce.richard...@intel.com]
>  Sent: Friday, 25 August 2023 10.23
> 
>  On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Brørup wrote:
> > Bruce,
> >
> > With this patch [1], it is noted that the ring producer and
> >> consumer data
>  should not be on adjacent cache lines, for performance reasons.
> >
> > [1]:
> 
> >>
> https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1f
> >> fd4b66
>  e75485cc8b63b9aedfbdfe8b0
> >
> > (It's obvious that they cannot share the same cache line, because
> >> they are
>  accessed by two different threads.)
> >
> > Intuitively, I would think that having them on different cache
> >> lines would
>  suffice. Why does having an empty cache line between them make a
> >> difference?
> >
> > And does it need to be an empty cache line? Or does it suffice
> >> having the
>  second structure start at two cache lines after the start of the
> >> first
>  structure (e.g. if the size of the first structure is two cache
> >> lines)?
> >
> > I'm asking because the same principle might apply to other code
> >> too.
> >
>  Hi Morten,
> 
>  this was something we discovered when working on the distributor
> >> library.
>  If we have cachelines per core where there is heavy access, having
> >> some
>  cachelines as a gap between the content cachelines can help
> >> performance. We
>  believe this helps due to avoiding issues with the HW prefetchers
> >> (e.g.
>  adjacent cacheline prefetcher) bringing in the second cacheline
>  speculatively when an operation is done on the first line.
> >>>
> >>> I guessed that it had something to do with speculative prefetching,
> >> but wasn't sure. Good to get confirmation, and that it has a
> measureable
> >> effect somewhere. Very interesting!
> >>>
> >>> NB: More comments in the ring lib about stuff like this would be
> nice.
> >>>
> >>> So, for the mempool lib, what do you think about applying the same
> >> technique to the rte_mempool_debug_stats structure (which is an array
> >> indexed per lcore)... Two adjacent lcores heavily accessing their
> local
> >> mempool caches seems likely to me. But how heavy does the access need
> to
> >> be for this technique to be relevant?
> >>>
> >>
> >> No idea how heavy the accesses need to be for this to have a
> noticable
> >> effect. For things like debug stats, I wonder how worthwhile making
> such
> >> a
> >> change would be, but then again, any change would have very low
> impact
> >> too
> >> in that case.
> >
> > I just tried adding padding to some of the hot structures in our own
> application, and observed a significant performance improvement for
> those.
> >
> > So I think this technique should have higher visibility in DPDK by
> adding a new cache macro to rte_common.h:
> >
> > /**
> >   * Empty cache line, to guard against speculative prefetching.
> >   *
> 
> "to guard against false sharing-like effects on systems with a
> next-N-lines hardware prefetcher"
> 
> >   * Use as spacing between data accessed by different lcores,
> >   * to prevent cache thrashing on CPUs with speculative prefetching.
> >   */
> > #define RTE_CACHE_GUARD(name) char
> cache_guard_##name[RTE_CACHE_LINE_SIZE] __rte_cache_aligned;
> >
> 
> You could have a macro which specified how much guarding there needs to
> be, ideally defined on a per-CPU basis. (These things has nothing to do
> with the ISA, but everything to do with the implementation.)
> 
> I'm not sure N is always 1.
> 
> So the guard padding should be RTE_CACHE_LINE_SIZE *
> RTE_CACHE_GUARD_LINES bytes, and wrap the whole thing in
> #if RTE_CACHE_GUARD_LINES > 0
> #endif
> 
> ...so you can disable this (cute!) hack (on custom DPDK builds) in case
> you have disabled hardware prefetching, which seems generally to be a
> good idea for packet processing type applications.
> 
> ...which leads me to another suggestions: add a note on disabling
> hardware prefetching in the optimization guide.
> 
> Seems like a very good idea to have this in , and
> otherwise make this issue visible and known.

Good points, Mattias!

I also prefer the name-less macro you suggested below.

So, this gets added to rte_common.h:

/**
 * Empty cache lines, to guard against false sharing-like effects
 * on systems with a next-N-lines hardware prefetcher.
 *
 * Use as spacing between data accessed by different lcores,
 * to prevent cache thrashing on hardware with speculative 

Re: [RFC] cache guard

2023-08-27 Thread Mattias Rönnblom

On 2023-08-27 17:40, Morten Brørup wrote:

From: Mattias Rönnblom [mailto:hof...@lysator.liu.se]
Sent: Sunday, 27 August 2023 15.55

On 2023-08-27 10:34, Morten Brørup wrote:

+CC Honnappa and Konstantin, Ring lib maintainers
+CC Mattias, PRNG lib maintainer


From: Bruce Richardson [mailto:bruce.richard...@intel.com]
Sent: Friday, 25 August 2023 11.24

On Fri, Aug 25, 2023 at 11:06:01AM +0200, Morten Brørup wrote:

+CC mempool maintainers


From: Bruce Richardson [mailto:bruce.richard...@intel.com]
Sent: Friday, 25 August 2023 10.23

On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Brørup wrote:

Bruce,

With this patch [1], it is noted that the ring producer and

consumer data

should not be on adjacent cache lines, for performance reasons.


[1]:





https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1f

fd4b66

e75485cc8b63b9aedfbdfe8b0


(It's obvious that they cannot share the same cache line, because

they are

accessed by two different threads.)


Intuitively, I would think that having them on different cache

lines would

suffice. Why does having an empty cache line between them make a

difference?


And does it need to be an empty cache line? Or does it suffice

having the

second structure start at two cache lines after the start of the

first

structure (e.g. if the size of the first structure is two cache

lines)?


I'm asking because the same principle might apply to other code

too.



Hi Morten,

this was something we discovered when working on the distributor

library.

If we have cachelines per core where there is heavy access, having

some

cachelines as a gap between the content cachelines can help

performance. We

believe this helps due to avoiding issues with the HW prefetchers

(e.g.

adjacent cacheline prefetcher) bringing in the second cacheline
speculatively when an operation is done on the first line.


I guessed that it had something to do with speculative prefetching,

but wasn't sure. Good to get confirmation, and that it has a

measureable

effect somewhere. Very interesting!


NB: More comments in the ring lib about stuff like this would be

nice.


So, for the mempool lib, what do you think about applying the same

technique to the rte_mempool_debug_stats structure (which is an array
indexed per lcore)... Two adjacent lcores heavily accessing their

local

mempool caches seems likely to me. But how heavy does the access need

to

be for this technique to be relevant?




No idea how heavy the accesses need to be for this to have a

noticable

effect. For things like debug stats, I wonder how worthwhile making

such

a
change would be, but then again, any change would have very low

impact

too
in that case.


I just tried adding padding to some of the hot structures in our own

application, and observed a significant performance improvement for
those.


So I think this technique should have higher visibility in DPDK by

adding a new cache macro to rte_common.h:


/**
   * Empty cache line, to guard against speculative prefetching.
   *


"to guard against false sharing-like effects on systems with a
next-N-lines hardware prefetcher"


   * Use as spacing between data accessed by different lcores,
   * to prevent cache thrashing on CPUs with speculative prefetching.
   */
#define RTE_CACHE_GUARD(name) char

cache_guard_##name[RTE_CACHE_LINE_SIZE] __rte_cache_aligned;




You could have a macro which specified how much guarding there needs to
be, ideally defined on a per-CPU basis. (These things has nothing to do
with the ISA, but everything to do with the implementation.)

I'm not sure N is always 1.

So the guard padding should be RTE_CACHE_LINE_SIZE *
RTE_CACHE_GUARD_LINES bytes, and wrap the whole thing in
#if RTE_CACHE_GUARD_LINES > 0
#endif

...so you can disable this (cute!) hack (on custom DPDK builds) in case
you have disabled hardware prefetching, which seems generally to be a
good idea for packet processing type applications.

...which leads me to another suggestions: add a note on disabling
hardware prefetching in the optimization guide.

Seems like a very good idea to have this in , and
otherwise make this issue visible and known.


Good points, Mattias!

I also prefer the name-less macro you suggested below.

So, this gets added to rte_common.h:

/**
  * Empty cache lines, to guard against false sharing-like effects
  * on systems with a next-N-lines hardware prefetcher.
  *
  * Use as spacing between data accessed by different lcores,
  * to prevent cache thrashing on hardware with speculative prefetching.
  */
#if RTE_CACHE_GUARD_LINES > 0
#define _RTE_CACHE_GUARD_HELPER2(unique) \
 char cache_guard_ ## unique[RTE_CACHE_LINE_SIZE * 
RTE_CACHE_GUARD_LINES] \
 __rte_cache_aligned;
#define _RTE_CACHE_GUARD_HELPER1(unique) _RTE_CACHE_GUARD_HELPER2(unique)
#define RTE_CACHE_GUARD _RTE_CACHE_GUARD_HELPER1(__COUNTER__)
#else
#define RTE_CACHE_GUARD
#endif



Seems like a good solution. I thought as far as using __LINE__ to build 

Re: [PATCH v3] bus/pci: fix legacy device IO port map in secondary process

2023-08-27 Thread Gupta, Nipun

Hi Wenwu

On 8/22/2023 7:48 AM, Wenwu Ma wrote:

When doing IO port mapping for legacy device
in secondary process, the region information
is missing, so, we need to refill it.


Please use 72 columns in the commit log



Fixes: 4b741542ecde ("bus/pci: avoid depending on private kernel value")
Cc: sta...@dpdk.org

Signed-off-by: Wenwu Ma 
---
v3:
  - adjusting variable settings
v2:
  - add release of device in pci_vfio_ioport_unmap

---
  drivers/bus/pci/linux/pci_vfio.c | 43 ++--
  1 file changed, 41 insertions(+), 2 deletions(-)

diff --git a/drivers/bus/pci/linux/pci_vfio.c b/drivers/bus/pci/linux/pci_vfio.c
index e634de8322..5ef26c98d1 100644
--- a/drivers/bus/pci/linux/pci_vfio.c
+++ b/drivers/bus/pci/linux/pci_vfio.c
@@ -1314,6 +1314,27 @@ pci_vfio_ioport_map(struct rte_pci_device *dev, int bar,
return -1;
}
  
+	if (rte_eal_process_type() == RTE_PROC_SECONDARY) {


One thing I am not fully convinced is that why VFIO setup is not 
required for primary process here?


Thanks,
Nipun


RE: [RFC] cache guard

2023-08-27 Thread Morten Brørup
> From: Mattias Rönnblom [mailto:hof...@lysator.liu.se]
> Sent: Monday, 28 August 2023 00.31
> 
> On 2023-08-27 17:40, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:hof...@lysator.liu.se]
> >> Sent: Sunday, 27 August 2023 15.55
> >>
> >> On 2023-08-27 10:34, Morten Brørup wrote:
> >>> +CC Honnappa and Konstantin, Ring lib maintainers
> >>> +CC Mattias, PRNG lib maintainer
> >>>
>  From: Bruce Richardson [mailto:bruce.richard...@intel.com]
>  Sent: Friday, 25 August 2023 11.24
> 
>  On Fri, Aug 25, 2023 at 11:06:01AM +0200, Morten Brørup wrote:
> > +CC mempool maintainers
> >
> >> From: Bruce Richardson [mailto:bruce.richard...@intel.com]
> >> Sent: Friday, 25 August 2023 10.23
> >>
> >> On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Brørup wrote:
> >>> Bruce,
> >>>
> >>> With this patch [1], it is noted that the ring producer and
>  consumer data
> >> should not be on adjacent cache lines, for performance reasons.
> >>>
> >>> [1]:
> >>
> 
> >> https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1f
>  fd4b66
> >> e75485cc8b63b9aedfbdfe8b0
> >>>
> >>> (It's obvious that they cannot share the same cache line, because
>  they are
> >> accessed by two different threads.)
> >>>
> >>> Intuitively, I would think that having them on different cache
>  lines would
> >> suffice. Why does having an empty cache line between them make a
>  difference?
> >>>
> >>> And does it need to be an empty cache line? Or does it suffice
>  having the
> >> second structure start at two cache lines after the start of the
>  first
> >> structure (e.g. if the size of the first structure is two cache
>  lines)?
> >>>
> >>> I'm asking because the same principle might apply to other code
>  too.
> >>>
> >> Hi Morten,
> >>
> >> this was something we discovered when working on the distributor
>  library.
> >> If we have cachelines per core where there is heavy access, having
>  some
> >> cachelines as a gap between the content cachelines can help
>  performance. We
> >> believe this helps due to avoiding issues with the HW prefetchers
>  (e.g.
> >> adjacent cacheline prefetcher) bringing in the second cacheline
> >> speculatively when an operation is done on the first line.
> >
> > I guessed that it had something to do with speculative prefetching,
>  but wasn't sure. Good to get confirmation, and that it has a
> >> measureable
>  effect somewhere. Very interesting!
> >
> > NB: More comments in the ring lib about stuff like this would be
> >> nice.
> >
> > So, for the mempool lib, what do you think about applying the same
>  technique to the rte_mempool_debug_stats structure (which is an array
>  indexed per lcore)... Two adjacent lcores heavily accessing their
> >> local
>  mempool caches seems likely to me. But how heavy does the access need
> >> to
>  be for this technique to be relevant?
> >
> 
>  No idea how heavy the accesses need to be for this to have a
> >> noticable
>  effect. For things like debug stats, I wonder how worthwhile making
> >> such
>  a
>  change would be, but then again, any change would have very low
> >> impact
>  too
>  in that case.
> >>>
> >>> I just tried adding padding to some of the hot structures in our own
> >> application, and observed a significant performance improvement for
> >> those.
> >>>
> >>> So I think this technique should have higher visibility in DPDK by
> >> adding a new cache macro to rte_common.h:
> >>>
> >>> /**
> >>>* Empty cache line, to guard against speculative prefetching.
> >>>*
> >>
> >> "to guard against false sharing-like effects on systems with a
> >> next-N-lines hardware prefetcher"
> >>
> >>>* Use as spacing between data accessed by different lcores,
> >>>* to prevent cache thrashing on CPUs with speculative prefetching.
> >>>*/
> >>> #define RTE_CACHE_GUARD(name) char
> >> cache_guard_##name[RTE_CACHE_LINE_SIZE] __rte_cache_aligned;
> >>>
> >>
> >> You could have a macro which specified how much guarding there needs to
> >> be, ideally defined on a per-CPU basis. (These things has nothing to do
> >> with the ISA, but everything to do with the implementation.)
> >>
> >> I'm not sure N is always 1.
> >>
> >> So the guard padding should be RTE_CACHE_LINE_SIZE *
> >> RTE_CACHE_GUARD_LINES bytes, and wrap the whole thing in
> >> #if RTE_CACHE_GUARD_LINES > 0
> >> #endif
> >>
> >> ...so you can disable this (cute!) hack (on custom DPDK builds) in case
> >> you have disabled hardware prefetching, which seems generally to be a
> >> good idea for packet processing type applications.
> >>
> >> ...which leads me to another suggestions: add a note on disabling
> >> hardware prefetching in the optimization guide.
> >>
> >> Seems like a very good idea t

[PATCH] test/crypto: fix return value for GMAC testcase

2023-08-27 Thread Hemant Agrawal
create_gmac_session is already converting the return value
of NOTSUPP to TEST_SKIPPED

Fixes: bdce2564dbf7 ("cryptodev: rework session framework")
Cc: gak...@marvell.com
Cc: sta...@dpdk.org

Signed-off-by: Hemant Agrawal 
---
 app/test/test_cryptodev.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/app/test/test_cryptodev.c b/app/test/test_cryptodev.c
index 956268bfcd..068e47a972 100644
--- a/app/test/test_cryptodev.c
+++ b/app/test/test_cryptodev.c
@@ -13538,7 +13538,7 @@ test_AES_GMAC_authentication(const struct 
gmac_test_data *tdata)
retval = create_gmac_session(ts_params->valid_devs[0],
tdata, RTE_CRYPTO_AUTH_OP_GENERATE);
 
-   if (retval == -ENOTSUP)
+   if (retval == TEST_SKIPPED)
return TEST_SKIPPED;
if (retval < 0)
return retval;
@@ -13671,7 +13671,7 @@ test_AES_GMAC_authentication_verify(const struct 
gmac_test_data *tdata)
retval = create_gmac_session(ts_params->valid_devs[0],
tdata, RTE_CRYPTO_AUTH_OP_VERIFY);
 
-   if (retval == -ENOTSUP)
+   if (retval == TEST_SKIPPED)
return TEST_SKIPPED;
if (retval < 0)
return retval;
@@ -13802,7 +13802,7 @@ test_AES_GMAC_authentication_SGL(const struct 
gmac_test_data *tdata,
retval = create_gmac_session(ts_params->valid_devs[0],
tdata, RTE_CRYPTO_AUTH_OP_GENERATE);
 
-   if (retval == -ENOTSUP)
+   if (retval == TEST_SKIPPED)
return TEST_SKIPPED;
if (retval < 0)
return retval;
@@ -14419,7 +14419,7 @@ test_authentication_verify_fail_when_data_corruption(
reference,
RTE_CRYPTO_AUTH_OP_VERIFY);
 
-   if (retval == -ENOTSUP)
+   if (retval == TEST_SKIPPED)
return TEST_SKIPPED;
if (retval < 0)
return retval;
@@ -14508,6 +14508,8 @@ test_authentication_verify_GMAC_fail_when_corruption(
reference,
RTE_CRYPTO_AUTH_OP_VERIFY,
RTE_CRYPTO_CIPHER_OP_DECRYPT);
+   if (retval == TEST_SKIPPED)
+   return TEST_SKIPPED;
if (retval < 0)
return retval;
 
@@ -14600,8 +14602,7 @@ test_authenticated_decryption_fail_when_corruption(
reference,
RTE_CRYPTO_AUTH_OP_VERIFY,
RTE_CRYPTO_CIPHER_OP_DECRYPT);
-
-   if (retval == -ENOTSUP)
+   if (retval == TEST_SKIPPED)
return TEST_SKIPPED;
if (retval < 0)
return retval;
-- 
2.25.1