RE: [PATCH v2] app/eventdev: add crypto producer mode

2021-12-30 Thread Gujjar, Abhinandan S
Hi Shijith,

> -Original Message-
> From: Shijith Thotton 
> Sent: Tuesday, December 21, 2021 2:21 PM
> To: dev@dpdk.org; Jerin Jacob 
> Cc: Shijith Thotton ; ano...@marvell.com;
> pbhagavat...@marvell.com; gak...@marvell.com; Gujjar, Abhinandan S
> 
> Subject: [PATCH v2] app/eventdev: add crypto producer mode
> 
> In crypto producer mode, producer core enqueues cryptodev with software
> generated crypto ops and worker core dequeues crypto completion events from
> the eventdev. Event crypto metadata used for above processing is pre-
> populated in each crypto session.
> 
> Parameter --prod_type_cryptodev can be used to enable crypto producer mode.
> Parameter --crypto_adptr_mode can be set to select the crypto adapter mode, 0
> for OP_NEW and 1 for OP_FORWARD.
> 
> This mode can be used to measure the performance of crypto adapter.
> 
> Example:
>   ./dpdk-test-eventdev -l 0-2 -w  -w  -- \
>   --prod_type_cryptodev --crypto_adptr_mode 1 --test=perf_atq \
>   --stlist=a --wlcores 1 --plcores 2

This patch has some perf failure as shown below. Could you please look into 
this?
105300 --> performance testing fail

Test environment and result as below:

Ubuntu 20.04
Kernel: 4.15.0-generic
Compiler: gcc 7.4
NIC: Intel Corporation Ethernet Converged Network Adapter 82599ES 1 Mbps
Target: x86_64-native-linuxapp-gcc
Fail/Total: 0/4

Detail performance results: 
++-+--+-+--+
| frame_size | txd/rxd | num_cpus | num_threads |  throughput difference from  |
|| |  | |   expected   |
++=+==+=+==+
| 64 | 512 | 1| 1   | 0.3% |
++-+--+-+--+
| 64 | 2048| 1| 1   | -0.2%|
++-+--+-+--+
| 64 | 512 | 1| 2   | 0.0% |
++-+--+-+--+
| 64 | 2048| 1| 2   | 0.3% |
++-+--+-+--+

Ubuntu 20.04
Kernel: 4.15.0-generic
Compiler: gcc 7.4
NIC: Intel Corporation Ethernet Converged Network Adapter XL710-QDA2 4 Mbps
Target: x86_64-native-linuxapp-gcc
Fail/Total: 1/4

Detail performance results: 
++-+--+-+--+
| frame_size | txd/rxd | num_cpus | num_threads |  throughput difference from  |
|| |  | |   expected   |
++=+==+=+==+
| 64 | 512 | 1| 1   | 0.2% |
++-+--+-+--+
| 64 | 2048| 1| 1   | -0.7%|
++-+--+-+--+
| 64 | 512 | 1| 2   | -1.5%|
++-+--+-+--+
| 64 | 2048| 1| 2   | -5.3%|
++-+--+-+--+

Ubuntu 20.04 ARM
Kernel: 4.15.0-132-generic
Compiler: gcc 7.5
NIC: Arm Intel Corporation Ethernet Converged Network Adapter XL710-QDA2 4 
Mbps
Target: x86_64-native-linuxapp-gcc
Fail/Total: 0/2

Detail performance results: 
++-+--+-+--+
| frame_size | txd/rxd | num_cpus | num_threads |  throughput difference from  |
|| |  | |   expected   |
++=+==+=+==+
| 64 | 512 | 1| 1   | 0.1% |
++-+--+-+--+
| 64 | 2048| 1| 1   | -0.5%|
++-+--+-+--+

To view detailed results, visit:
https://lab.dpdk.org/results/dashboard/patchsets/20534/

> 
> Signed-off-by: Shijith Thotton 
> ---
> v2:
> * Fix RHEL compilation warning.
> 
>  app/test-eventdev/evt_common.h   |   3 +
>  app/test-eventdev/evt_main.c |  13 +-
>  app/test-eventdev/evt_options.c  |  27 ++
>  app/test-eventdev/evt_options.h  |  12 +
>  app/test-eventdev/evt_test.h |   6 +
>  app/test-eventdev/test_perf_atq.c|  51 
>  app/test-eventdev/test_perf_common.c | 406 ++-
> app/test-eventdev/test_perf_

RE: 19.11.11 (RC2) patches review and test

2021-12-30 Thread Ali Alnubani
> -Original Message-
> From: christian.ehrha...@canonical.com
> 
> Sent: Monday, December 20, 2021 10:00 AM
> To: sta...@dpdk.org
> Cc: dev@dpdk.org; Abhishek Marathe ;
> Akhil Goyal ; Ali Alnubani ;
> benjamin.wal...@intel.com; David Christensen ;
> Hemant Agrawal ; Ian Stokes
> ; Jerin Jacob ; John McNamara
> ; Ju-Hyoung Lee ;
> Kevin Traynor ; Luca Boccassi ;
> Pei Zhang ; pingx...@intel.com;
> qian.q...@intel.com; Raslan Darawsheh ; NBU-
> Contact-Thomas Monjalon (EXTERNAL) ;
> yuan.p...@intel.com; zhaoyan.c...@intel.com
> Subject: 19.11.11 (RC2) patches review and test
> 
> Hi all,
> 
> Here is a list of patches targeted for stable release 19.11.11.
> 
> The planned date for the final release is 7th January 2021.
> 
> Please help with testing and validation of your use cases and report
> any issues/results with reply-all to this mail. For the final release
> the fixes and reported validations will be added to the release notes.
> 
> This -rc2 is supposed to be functionally equivalent to the -rc1 version
> we had 11 days ago. The only fixes added since v19.11.11-rc1 are for
> typos (in comments) and to fix compilation issues on some kernels
> and newer toolchains. We still can't build everything with clang13 (19.11
> never built there, this is not a regression), but various issues blocking
> that are resolved already. The issues with SLES15 kernels are resolved as
> well as using LTO with gcc 10 is fixed.
> 
> Therefore there is no strict need to rerun all tests on -rc2 - OTOH by all
> means if you have the time and capacity I'd absolutely appreciate if
> you could do so.
> What is important and should be tested are various builds, to ensure that
> none of these changes broke a build for those target platforms that worked
> before.
> And furthermore - if before you had further functional tests blocked by
> those build issues - then now you can build and run those further tests
> that were formerly blocked.
> 
> List of fixed bugs since -rc1:
> - https://bugs.dpdk.org/show_bug.cgi?id=745
> - https://bugs.dpdk.org/show_bug.cgi?id=900
> - https://bugs.dpdk.org/show_bug.cgi?id=901
> - https://bugs.dpdk.org/show_bug.cgi?id=902
> - https://bugs.dpdk.org/show_bug.cgi?id=903
> - https://bugs.dpdk.org/show_bug.cgi?id=907
> - https://bugs.dpdk.org/show_bug.cgi?id=908
> - Build on FreeBSD 13 (had no bug number)
> 
> Known and still remaining are:
> - https://bugs.dpdk.org/show_bug.cgi?id=744
> - https://bugs.dpdk.org/show_bug.cgi?id=747
> - https://bugs.dpdk.org/show_bug.cgi?id=904
> - https://bugs.dpdk.org/show_bug.cgi?id=905
> - https://bugs.dpdk.org/show_bug.cgi?id=912
> 
> For everyone helping to fix so many of them already, thank you a lot!
> Maybe not on 19.11.11, if the joint efforts continue maybe 19.11.12
> will be able to resolve all these issues that are currently left.
> So keep the patches coming. Even after 19.11.11 is released I'll continue
> to enqueue your build fixes and since we just extended the lifetime of 19.11
> to three years there will be a 19.11.12 coming to pick them up.
> 

Hi,

The following covers the functional tests that we ran on Nvidia hardware for 
this release:
- Basic functionality:
  Send and receive multiple types of traffic.
- testpmd xstats counter test.
- testpmd timestamp test.
- Changing/checking link status through testpmd.
- RTE flow tests:
  Items: eth / ipv4 / ipv6 / tcp / udp / icmp / gre / nvgre / geneve / vxlan / 
mplsoudp / mplsogre
  Actions: drop / queue / rss / mark / flag / jump / count / raw_encap / 
raw_decap / vxlan_encap / vxlan_decap / NAT / dec_ttl
- Some RSS tests.
- VLAN filtering, stripping and insertion tests.
- Checksum and TSO tests.
- ptype tests.
- link_status_interrupt example application tests.
- l3fwd-power example application tests.
- Multi-process example applications tests.

Functional tests ran on:
- NIC: ConnectX-4 Lx / OS: Ubuntu 20.04 LTS / Driver: 
MLNX_OFED_LINUX-5.5-1.0.3.2 / Firmware: 14.32.1010
- NIC: ConnectX-4 Lx / OS: Ubuntu 20.04 LTS / Kernel: 5.16.0-rc7 / Driver: 
rdma-core v38.0 / Firmware: 14.32.1010
- NIC: ConnectX-5 / OS: Ubuntu 20.04 LTS / Driver: MLNX_OFED_LINUX-5.5-1.0.3.2 
/ Firmware: 16.32.1010
- NIC: ConnectX-5 / OS: Ubuntu 20.04 LTS / Kernel: 5.16.0-rc7 / Driver: v38.0 / 
Firmware: 16.32.1010

Compilation tests with multiple configurations in the following OS/driver 
combinations are also passing:
- Ubuntu 20.04.3 with MLNX_OFED_LINUX-5.5-1.0.3.2.
- Ubuntu 20.04.3 with rdma-core master (c52b43e).
- Ubuntu 20.04.3 with rdma-core v28.0.
- Ubuntu 18.04.6 with rdma-core v17.1.
- Ubuntu 18.04.6 with rdma-core master (c52b43e) (i386).
- Ubuntu 16.04.7 with rdma-core v22.7.
- Fedora 35 with rdma-core v38.0 (passing except for make builds with clang, 
see: https://bugs.dpdk.org/show_bug.cgi?id=912).
- Fedora 36 (Rawhide) with rdma-core v38.0
- CentOS 7 7.9.2009 with rdma-core master (940f53f).
- CentOS 7 7.9.2009 with MLNX_OFED_LINUX-5.5-1.0.3.2.
- CentOS 8 8.4.2105 with rdma-core master (940f53f).
- OpenSUSE Leap

[PATCH v1 0/1] integrate dmadev in vhost

2021-12-30 Thread Jiayu Hu
Since dmadev is introduced in 21.11, to avoid the overhead of vhost DMA
abstraction layer and simplify application logics, this patch integrates
dmadev in vhost.

To enable the flexibility of using DMA devices in different function
modules, not limited in vhost, vhost doesn't manage DMA devices.
Applications, like OVS, need to manage and configure DMA devices and
tell vhost what DMA device to use in every dataplane function call.

In addition, vhost supports M:N mapping between vrings and DMA virtual
channels. Specifically, one vring can use multiple different DMA channels
and one DMA channel can be shared by multiple vrings at the same time.
The reason of enabling one vring to use multiple DMA channels is that
it's possible that more than one dataplane threads enqueue packets to
the same vring with their own DMA virtual channels. Besides, the number
of DMA devices is limited. For the purpose of scaling, it's necessary to
support sharing DMA channels among vrings.

As only enqueue path is enabled DMA acceleration, the new dataplane
functions are like:
1). rte_vhost_submit_enqueue_burst(vid, queue_id, pkts, count, dma_id,
dma_vchan):
Get descriptors and submit copies to DMA virtual channel for the
packets that need to be send to VM.
 
2). rte_vhost_poll_enqueue_completed(vid, queue_id, pkts, count, dma_id,
dma_vchan):
Check completed DMA copies from the given DMA virtual channel and
write back corresponding descriptors to vring.

OVS needs to call rte_vhost_poll_enqueue_completed to clean in-flight
copies on previous call and it can be called inside rxq_recv function,
so that it doesn't require big change in OVS datapath. For example:
netdev_dpdk_vhost_rxq_recv() {
...
qid = rxq->queue_id * VIRTIO_QNUM + VIRTIO_RXQ;
rte_vhost_poll_enqueue_completed(vid, qid, ...);
}

Change log
==
rfc -> v1:
- remove useless code
- support dynamic DMA vchannel ring size (rte_vhost_async_dma_configure)
- fix several bugs
- fix typo and coding style issues
- replace "while" with "for"
- update programmer guide 
- support share dma among vhost in vhost example
- remove "--dma-type" in vhost example

Jiayu Hu (1):
  vhost: integrate dmadev in asynchronous datapath

 doc/guides/prog_guide/vhost_lib.rst |  70 -
 examples/vhost/Makefile |   2 +-
 examples/vhost/ioat.c   | 218 --
 examples/vhost/ioat.h   |  63 
 examples/vhost/main.c   | 230 +++-
 examples/vhost/main.h   |  11 ++
 examples/vhost/meson.build  |   6 +-
 lib/vhost/meson.build   |   3 +-
 lib/vhost/rte_vhost_async.h | 121 +--
 lib/vhost/version.map   |   3 +
 lib/vhost/vhost.c   | 130 +++-
 lib/vhost/vhost.h   |  53 ++-
 lib/vhost/virtio_net.c  | 206 +++--
 13 files changed, 587 insertions(+), 529 deletions(-)
 delete mode 100644 examples/vhost/ioat.c
 delete mode 100644 examples/vhost/ioat.h

-- 
2.25.1



[PATCH v1 1/1] vhost: integrate dmadev in asynchronous datapath

2021-12-30 Thread Jiayu Hu
Since dmadev is introduced in 21.11, to avoid the overhead of vhost DMA
abstraction layer and simplify application logics, this patch integrates
dmadev in asynchronous data path.

Signed-off-by: Jiayu Hu 
Signed-off-by: Sunil Pai G 
---
 doc/guides/prog_guide/vhost_lib.rst |  70 -
 examples/vhost/Makefile |   2 +-
 examples/vhost/ioat.c   | 218 --
 examples/vhost/ioat.h   |  63 
 examples/vhost/main.c   | 230 +++-
 examples/vhost/main.h   |  11 ++
 examples/vhost/meson.build  |   6 +-
 lib/vhost/meson.build   |   3 +-
 lib/vhost/rte_vhost_async.h | 121 +--
 lib/vhost/version.map   |   3 +
 lib/vhost/vhost.c   | 130 +++-
 lib/vhost/vhost.h   |  53 ++-
 lib/vhost/virtio_net.c  | 206 +++--
 13 files changed, 587 insertions(+), 529 deletions(-)
 delete mode 100644 examples/vhost/ioat.c
 delete mode 100644 examples/vhost/ioat.h

diff --git a/doc/guides/prog_guide/vhost_lib.rst 
b/doc/guides/prog_guide/vhost_lib.rst
index 76f5d303c9..bdce7cbf02 100644
--- a/doc/guides/prog_guide/vhost_lib.rst
+++ b/doc/guides/prog_guide/vhost_lib.rst
@@ -218,38 +218,12 @@ The following is an overview of some key Vhost API 
functions:
 
   Enable or disable zero copy feature of the vhost crypto backend.
 
-* ``rte_vhost_async_channel_register(vid, queue_id, config, ops)``
+* ``rte_vhost_async_channel_register(vid, queue_id)``
 
   Register an async copy device channel for a vhost queue after vring
-  is enabled. Following device ``config`` must be specified together
-  with the registration:
+  is enabled.
 
-  * ``features``
-
-This field is used to specify async copy device features.
-
-``RTE_VHOST_ASYNC_INORDER`` represents the async copy device can
-guarantee the order of copy completion is the same as the order
-of copy submission.
-
-Currently, only ``RTE_VHOST_ASYNC_INORDER`` capable device is
-supported by vhost.
-
-  Applications must provide following ``ops`` callbacks for vhost lib to
-  work with the async copy devices:
-
-  * ``transfer_data(vid, queue_id, descs, opaque_data, count)``
-
-vhost invokes this function to submit copy data to the async devices.
-For non-async_inorder capable devices, ``opaque_data`` could be used
-for identifying the completed packets.
-
-  * ``check_completed_copies(vid, queue_id, opaque_data, max_packets)``
-
-vhost invokes this function to get the copy data completed by async
-devices.
-
-* ``rte_vhost_async_channel_register_thread_unsafe(vid, queue_id, config, 
ops)``
+* ``rte_vhost_async_channel_register_thread_unsafe(vid, queue_id)``
 
   Register an async copy device channel for a vhost queue without
   performing any locking.
@@ -277,18 +251,13 @@ The following is an overview of some key Vhost API 
functions:
   This function is only safe to call in vhost callback functions
   (i.e., struct rte_vhost_device_ops).
 
-* ``rte_vhost_submit_enqueue_burst(vid, queue_id, pkts, count, comp_pkts, 
comp_count)``
+* ``rte_vhost_submit_enqueue_burst(vid, queue_id, pkts, count, dma_id, 
dma_vchan)``
 
   Submit an enqueue request to transmit ``count`` packets from host to guest
-  by async data path. Successfully enqueued packets can be transfer completed
-  or being occupied by DMA engines; transfer completed packets are returned in
-  ``comp_pkts``, but others are not guaranteed to finish, when this API
-  call returns.
+  by async data path. Applications must not free the packets submitted for
+  enqueue until the packets are completed.
 
-  Applications must not free the packets submitted for enqueue until the
-  packets are completed.
-
-* ``rte_vhost_poll_enqueue_completed(vid, queue_id, pkts, count)``
+* ``rte_vhost_poll_enqueue_completed(vid, queue_id, pkts, count, dma_id, 
dma_vchan)``
 
   Poll enqueue completion status from async data path. Completed packets
   are returned to applications through ``pkts``.
@@ -298,7 +267,7 @@ The following is an overview of some key Vhost API 
functions:
   This function returns the amount of in-flight packets for the vhost
   queue using async acceleration.
 
-* ``rte_vhost_clear_queue_thread_unsafe(vid, queue_id, **pkts, count)``
+* ``rte_vhost_clear_queue_thread_unsafe(vid, queue_id, **pkts, count, dma_id, 
dma_vchan)``
 
   Clear inflight packets which are submitted to DMA engine in vhost async data
   path. Completed packets are returned to applications through ``pkts``.
@@ -442,3 +411,26 @@ Finally, a set of device ops is defined for device 
specific operations:
 * ``get_notify_area``
 
   Called to get the notify area info of the queue.
+
+Vhost asynchronous data path
+
+
+Vhost asynchronous data path leverages DMA devices to offload memory
+copies from the CPU and it is implemented in an asynchronous way. It
+enables 

[RFC PATCH 0/6] Fast restart with many hugepages

2021-12-30 Thread Dmitry Kozlyuk
This patchset is a new design and implementation of [1].

# Problem Statement

Large allocations that involve mapping new hugepages are slow.
This is problematic, for example, in the following use case.
A single-process application allocates ~1TB of mempools at startup.
Sometimes the app needs to restart as quick as possible.
Allocating the hugepages anew takes as long as 15 seconds,
while the new process could just pick up all the memory
left by the old one (reinitializing the contents as needed).

Almost all of mmap(2) time spent in the kernel
is clearing the memory, i.e. filling it with zeros.
This is done if a file in hugetlbfs is mapped
for the first time system-wide, i.e. a hugepage is committed
to prevent data leaks from the previous users of the same hugepage.
For example, mapping 32 GB from a new file may take 2.16 seconds,
while mapping the same pages again takes only 0.3 ms.
Security put aside, e.g. when the environment is controlled,
this effort is wasted for the memory intended for DMA,
because its content will be overwritten anyway.

Linux EAL explicitly removes hugetlbfs files at initialization
and before mapping to force the kernel clear the memory.
This allows the memory allocator to clean memory on only on freeing.

# Solution

Add a new mode allowing EAL to remap existing hugepage files.
While it is intended to make restarts faster in the first place,
it makes any startup faster except the cold one
(with no existing files).

It is the administrator who accepts security risks
implied by reusing hugepages.
The new mode is an opt-in and a warning is logged.

The feature is Linux-only as it is related
to mapping hugepages from files which only Linux does.
It is inherently incompatible with --in-memory,
for --huge-unlink see below.

There is formally no breakage of API contract,
but there is a behavior change in the new mode:
rte_malloc*() and rte_memzone_reserve*() may return dirty memory
(previously they were returning clean memory from free heap elements).
Their contract has always explicitly allowed this,
but still there may be users relying on the traditional behavior.
Such users will need to fix their code to use the new mode.

# Implementation

## User Interface

There is --huge-unlink switch in the same area to remove hugepage files
before mapping them. It is infeasible to use with the new mode,
because the point is to keep hugepage files for fast future restarts.
Extend --huge-unlink option to represent only valid combinations:

* --huge-unlink=existing OR no option (for compatibility):
  unlink files at initialization
  and before opening them as a precaution.

* --huge-unlink=always OR just --huge-unlink (for compatibility):
  same as above + unlink created files before mapping.

* --huge-unlink=never:
  the new mode, do not unlink hugepages files, reuse them.

This option was always Linux-only, but it is kept as common
in case there are users who expect it to be a no-op on other systems.
(Adding a separate --huge-reuse option was also considered,
but there is no obvious benefit and more combinations to test.)

## EAL

If a memseg is mapped dirty, it is marked with RTE_MEMSEG_FLAG_DIRTY
so that the memory allocator may clear the memory if need be.
See patch 4/6 description for details.

The memory manager tracks whether an element is clean or dirty.
If rte_zmalloc*() allocates from a dirty element,
the memory is cleared before handling it to the user.
On freeing, the allocator joins adjacent free elements,
but in the new mode it may not be feasible to clear the free memory
if the joint element is dirty (contains dirty parts).
In any case, memory will be cleared only once,
either on freeing or on allocation. See patch 2/6 for details.
Patch 6/6 adds a benchmark to see how time is distributed
between allocation and freeing in different modes.

Besides clearing memory, each mmap() call takes some time which adds up.
EAL does one call per hugepage, 1024 calls for 1 TB may take ~300 ms.
It does so in order to be able to unmap the segments one by one.
However, segments from initial allocation (-m) are never unmapped.
Ideally, initial allocation should take one mmap() call per memory type
(i.e. per NUMA node per page size) if --single-file-segments is used.
This further optimization is not implemented in current version.

[1]: http://inbox.dpdk.org/dev/20211011085644.2716490-3-dkozl...@nvidia.com/

Dmitry Kozlyuk (6):
  doc: add hugepage mapping details
  mem: add dirty malloc element support
  eal: refactor --huge-unlink storage
  eal/linux: allow hugepage file reuse
  eal: allow hugepage file reuse with --huge-unlink
  app/test: add allocator performance benchmark

 app/test/meson.build  |   2 +
 app/test/test_malloc_perf.c   | 174 ++
 doc/guides/linux_gsg/linux_eal_parameters.rst |  21 ++-
 .../prog_guide/env_abstraction_layer.rst  |  94 +-
 doc/guides/rel_notes/release_22_03.rst|   7 +
 lib/eal/common/eal_comm

[RFC PATCH 1/6] doc: add hugepage mapping details

2021-12-30 Thread Dmitry Kozlyuk
Hugepage mapping is a layer of EAL malloc builds upon.
There were implicit references to its details,
like mentions of segment file descriptors,
but no explicit description of its modes and operation.
Add an overview of mechanics used on ech supported OS.
Convert memory management subsections from list items
to level 4 headers: they are big and important enough.

Signed-off-by: Dmitry Kozlyuk 
---
 .../prog_guide/env_abstraction_layer.rst  | 85 +--
 1 file changed, 76 insertions(+), 9 deletions(-)

diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst 
b/doc/guides/prog_guide/env_abstraction_layer.rst
index 29f6fefc48..6cddb86467 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -86,7 +86,7 @@ See chapter
 Memory Mapping Discovery and Memory Reservation
 ~~~
 
-The allocation of large contiguous physical memory is done using the hugetlbfs 
kernel filesystem.
+The allocation of large contiguous physical memory is done using hugepages.
 The EAL provides an API to reserve named memory zones in this contiguous 
memory.
 The physical address of the reserved memory for that memory zone is also 
returned to the user by the memory zone reservation API.
 
@@ -95,11 +95,12 @@ and legacy mode. Both modes are explained below.
 
 .. note::
 
-Memory reservations done using the APIs provided by rte_malloc are also 
backed by pages from the hugetlbfs filesystem.
+Memory reservations done using the APIs provided by rte_malloc are also 
backed by hugepages.
 
-+ Dynamic memory mode
+Dynamic Memory Mode
+^^^
 
-Currently, this mode is only supported on Linux.
+Currently, this mode is only supported on Linux and Windows.
 
 In this mode, usage of hugepages by DPDK application will grow and shrink based
 on application's requests. Any memory allocation through ``rte_malloc()``,
@@ -155,7 +156,8 @@ of memory that can be used by DPDK application.
 :ref:`Multi-process Support ` for more details about
 DPDK IPC.
 
-+ Legacy memory mode
+Legacy Memory Mode
+^^
 
 This mode is enabled by specifying ``--legacy-mem`` command-line switch to the
 EAL. This switch will have no effect on FreeBSD as FreeBSD only supports
@@ -168,7 +170,8 @@ not allow acquiring or releasing hugepages from the system 
at runtime.
 If neither ``-m`` nor ``--socket-mem`` were specified, the entire available
 hugepage memory will be preallocated.
 
-+ Hugepage allocation matching
+Hugepage Allocation Matching
+
 
 This behavior is enabled by specifying the ``--match-allocations`` command-line
 switch to the EAL. This switch is Linux-only and not supported with
@@ -182,7 +185,8 @@ matching can be used by these types of applications to 
satisfy both of these
 requirements. This can result in some increased memory usage which is
 very dependent on the memory allocation patterns of the application.
 
-+ 32-bit support
+32-bit Support
+^^
 
 Additional restrictions are present when running in 32-bit mode. In dynamic
 memory mode, by default maximum of 2 gigabytes of VA space will be 
preallocated,
@@ -192,7 +196,8 @@ used.
 In legacy mode, VA space will only be preallocated for segments that were
 requested (plus padding, to keep IOVA-contiguousness).
 
-+ Maximum amount of memory
+Maximum Amount of Memory
+
 
 All possible virtual memory space that can ever be used for hugepage mapping in
 a DPDK process is preallocated at startup, thereby placing an upper limit on 
how
@@ -222,7 +227,68 @@ Normally, these options do not need to be changed.
 can later be mapped into that preallocated VA space (if dynamic memory mode
 is enabled), and can optionally be mapped into it at startup.
 
-+ Segment file descriptors
+Hugepage Mapping
+
+
+Below is an overview of methods used for each OS to obtain hugepages,
+explaining why certain limitations and options exist in EAL.
+See the user guide for a specific OS for configuration details.
+
+FreeBSD uses ``contigmem`` kernel module
+to reserve a fixed number of hugepages at system start,
+which are mapped by EAL at initialization using a specific ``sysctl()``.
+
+Windows EAL allocates hugepages from the OS as needed using Win32 API,
+so available amount depends on the system load.
+It uses ``virt2phys`` kernel module to obtain physical addresses,
+unless running in IOVA-as-VA mode (e.g. forced with ``--iova-mode=va``).
+
+Linux implements a variety of methods:
+
+* mapping each hugepage from its own file in hugetlbfs;
+* mapping multiple hugepages from a shared file in hugetlbfs;
+* anonymous mapping.
+
+Mapping hugepages from files in hugetlbfs is essential for multi-process,
+because secondary processes need to map the same hugepages.
+EAL creates files like ``rtemap_0``
+in directories specified with ``--huge-dir`` option
+(or in the mount point for a speci

[RFC PATCH 2/6] mem: add dirty malloc element support

2021-12-30 Thread Dmitry Kozlyuk
EAL malloc layer assumed all free elements content
is filled with zeros ("clean"), as opposed to uninitialized ("dirty").
This assumption was ensured in two ways:
1. EAL memalloc layer always returned clean memory.
2. Freed memory was cleared before returning into the heap.

Clearing the memory can be as slow as around 14 GiB/s.
To save doing so, memalloc layer is allowed to return dirty memory.
Such segments being marked with RTE_MEMSEG_FLAG_DIRTY.
The allocator tracks elements that contain dirty memory
using the new flag in the element header.
When clean memory is requested via rte_zmalloc*()
and the suitable element is dirty, it is cleared on allocation.
When memory is deallocated, the freed element is joined
with adjacent free elements, and the dirty flag is updated:

dirty + freed + dirty = dirty  =>  no need to clean
freed + dirty = dirty  the freed memory

clean + freed + clean = clean  =>  freed memory
clean + freed = clean  must be cleared
freed + clean = clean
freed = clean

As a result, memory is either cleared on free, as before,
or it will be cleared on allocation if need be, but never twice.

Signed-off-by: Dmitry Kozlyuk 
---
 lib/eal/common/malloc_elem.c | 22 +++---
 lib/eal/common/malloc_elem.h | 11 +--
 lib/eal/common/malloc_heap.c | 18 --
 lib/eal/common/rte_malloc.c  | 21 ++---
 lib/eal/include/rte_memory.h |  8 ++--
 5 files changed, 60 insertions(+), 20 deletions(-)

diff --git a/lib/eal/common/malloc_elem.c b/lib/eal/common/malloc_elem.c
index bdd20a162e..e04e0890fb 100644
--- a/lib/eal/common/malloc_elem.c
+++ b/lib/eal/common/malloc_elem.c
@@ -129,7 +129,7 @@ malloc_elem_find_max_iova_contig(struct malloc_elem *elem, 
size_t align)
 void
 malloc_elem_init(struct malloc_elem *elem, struct malloc_heap *heap,
struct rte_memseg_list *msl, size_t size,
-   struct malloc_elem *orig_elem, size_t orig_size)
+   struct malloc_elem *orig_elem, size_t orig_size, bool dirty)
 {
elem->heap = heap;
elem->msl = msl;
@@ -137,6 +137,7 @@ malloc_elem_init(struct malloc_elem *elem, struct 
malloc_heap *heap,
elem->next = NULL;
memset(&elem->free_list, 0, sizeof(elem->free_list));
elem->state = ELEM_FREE;
+   elem->dirty = dirty;
elem->size = size;
elem->pad = 0;
elem->orig_elem = orig_elem;
@@ -300,7 +301,7 @@ split_elem(struct malloc_elem *elem, struct malloc_elem 
*split_pt)
const size_t new_elem_size = elem->size - old_elem_size;
 
malloc_elem_init(split_pt, elem->heap, elem->msl, new_elem_size,
-elem->orig_elem, elem->orig_size);
+   elem->orig_elem, elem->orig_size, elem->dirty);
split_pt->prev = elem;
split_pt->next = next_elem;
if (next_elem)
@@ -506,6 +507,7 @@ join_elem(struct malloc_elem *elem1, struct malloc_elem 
*elem2)
else
elem1->heap->last = elem1;
elem1->next = next;
+   elem1->dirty |= elem2->dirty;
if (elem1->pad) {
struct malloc_elem *inner = RTE_PTR_ADD(elem1, elem1->pad);
inner->size = elem1->size - elem1->pad;
@@ -579,6 +581,14 @@ malloc_elem_free(struct malloc_elem *elem)
ptr = RTE_PTR_ADD(elem, MALLOC_ELEM_HEADER_LEN);
data_len = elem->size - MALLOC_ELEM_OVERHEAD;
 
+   /*
+* Consider the element clean for the purposes of joining.
+* If both neighbors are clean or non-existent,
+* the joint element will be clean,
+* which means the memory should be cleared.
+* There is no need to clear the memory if the joint element is dirty.
+*/
+   elem->dirty = false;
elem = malloc_elem_join_adjacent_free(elem);
 
malloc_elem_free_list_insert(elem);
@@ -588,8 +598,14 @@ malloc_elem_free(struct malloc_elem *elem)
/* decrease heap's count of allocated elements */
elem->heap->alloc_count--;
 
-   /* poison memory */
+#ifndef RTE_MALLOC_DEBUG
+   /* Normally clear the memory when needed. */
+   if (!elem->dirty)
+   memset(ptr, 0, data_len);
+#else
+   /* Always poison the memory in debug mode. */
memset(ptr, MALLOC_POISON, data_len);
+#endif
 
return elem;
 }
diff --git a/lib/eal/common/malloc_elem.h b/lib/eal/common/malloc_elem.h
index 15d8ba7af2..f2aa98821b 100644
--- a/lib/eal/common/malloc_elem.h
+++ b/lib/eal/common/malloc_elem.h
@@ -27,7 +27,13 @@ struct malloc_elem {
LIST_ENTRY(malloc_elem) free_list;
/**< list of free elements in heap */
struct rte_memseg_list *msl;
-   volatile enum elem_state state;
+   /** Element state, @c dirty and @c pad validity depends on it. */
+   /* An extra bit is needed to represent enum elem_state as signed int. */
+   enum elem_state state : 3;
+   /** If state == ELEM_FR

[RFC PATCH 3/6] eal: refactor --huge-unlink storage

2021-12-30 Thread Dmitry Kozlyuk
In preparation to extend --huge-unlink option semantics
refactor how it is stored in the internal configuration.
It makes future changes more isolated.

Signed-off-by: Dmitry Kozlyuk 
---
 lib/eal/common/eal_common_options.c | 9 +
 lib/eal/common/eal_internal_cfg.h   | 8 +++-
 lib/eal/linux/eal_memalloc.c| 7 ---
 lib/eal/linux/eal_memory.c  | 2 +-
 4 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/lib/eal/common/eal_common_options.c 
b/lib/eal/common/eal_common_options.c
index 1cfdd75f3b..7520ebda8e 100644
--- a/lib/eal/common/eal_common_options.c
+++ b/lib/eal/common/eal_common_options.c
@@ -1737,7 +1737,7 @@ eal_parse_common_option(int opt, const char *optarg,
 
/* long options */
case OPT_HUGE_UNLINK_NUM:
-   conf->hugepage_unlink = 1;
+   conf->hugepage_file.unlink_before_mapping = true;
break;
 
case OPT_NO_HUGE_NUM:
@@ -1766,7 +1766,7 @@ eal_parse_common_option(int opt, const char *optarg,
conf->in_memory = 1;
/* in-memory is a superset of noshconf and huge-unlink */
conf->no_shconf = 1;
-   conf->hugepage_unlink = 1;
+   conf->hugepage_file.unlink_before_mapping = true;
break;
 
case OPT_PROC_TYPE_NUM:
@@ -2050,7 +2050,8 @@ eal_check_common_options(struct internal_config 
*internal_cfg)
"be specified together with --"OPT_NO_HUGE"\n");
return -1;
}
-   if (internal_cfg->no_hugetlbfs && internal_cfg->hugepage_unlink &&
+   if (internal_cfg->no_hugetlbfs &&
+   internal_cfg->hugepage_file.unlink_before_mapping &&
!internal_cfg->in_memory) {
RTE_LOG(ERR, EAL, "Option --"OPT_HUGE_UNLINK" cannot "
"be specified together with --"OPT_NO_HUGE"\n");
@@ -2061,7 +2062,7 @@ eal_check_common_options(struct internal_config 
*internal_cfg)
" is only supported in non-legacy memory mode\n");
}
if (internal_cfg->single_file_segments &&
-   internal_cfg->hugepage_unlink &&
+   internal_cfg->hugepage_file.unlink_before_mapping &&
!internal_cfg->in_memory) {
RTE_LOG(ERR, EAL, "Option --"OPT_SINGLE_FILE_SEGMENTS" is "
"not compatible with --"OPT_HUGE_UNLINK"\n");
diff --git a/lib/eal/common/eal_internal_cfg.h 
b/lib/eal/common/eal_internal_cfg.h
index d6c0470eb8..b5e6942578 100644
--- a/lib/eal/common/eal_internal_cfg.h
+++ b/lib/eal/common/eal_internal_cfg.h
@@ -40,6 +40,12 @@ struct simd_bitwidth {
uint16_t bitwidth; /**< bitwidth value */
 };
 
+/** Hugepage backing files discipline. */
+struct hugepage_file_discipline {
+   /** Unlink files before mapping them to leave no trace in hugetlbfs. */
+   bool unlink_before_mapping;
+};
+
 /**
  * internal configuration
  */
@@ -48,7 +54,7 @@ struct internal_config {
volatile unsigned force_nchannel; /**< force number of channels */
volatile unsigned force_nrank;/**< force number of ranks */
volatile unsigned no_hugetlbfs;   /**< true to disable hugetlbfs */
-   unsigned hugepage_unlink; /**< true to unlink backing files */
+   struct hugepage_file_discipline hugepage_file;
volatile unsigned no_pci; /**< true to disable PCI */
volatile unsigned no_hpet;/**< true to disable HPET */
volatile unsigned vmware_tsc_map; /**< true to use VMware TSC mapping
diff --git a/lib/eal/linux/eal_memalloc.c b/lib/eal/linux/eal_memalloc.c
index 337f2bc739..abbe605e49 100644
--- a/lib/eal/linux/eal_memalloc.c
+++ b/lib/eal/linux/eal_memalloc.c
@@ -564,7 +564,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
__func__, strerror(errno));
goto resized;
}
-   if (internal_conf->hugepage_unlink &&
+   if (internal_conf->hugepage_file.unlink_before_mapping 
&&
!internal_conf->in_memory) {
if (unlink(path)) {
RTE_LOG(DEBUG, EAL, "%s(): unlink() 
failed: %s\n",
@@ -697,7 +697,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
close_hugefile(fd, path, list_idx);
} else {
/* only remove file if we can take out a write lock */
-   if (internal_conf->hugepage_unlink == 0 &&
+   if (!internal_conf->hugepage_file.unlink_before_mapping &&
internal_conf->in_memory == 0 &&
lock(fd, LOCK_EX) == 1)
unlink(path);
@@ -756,7 +756,8 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
/

[RFC PATCH 4/6] eal/linux: allow hugepage file reuse

2021-12-30 Thread Dmitry Kozlyuk
Linux EAL ensured that mapped hugepages are clean
by always mapping from newly created files:
existing hugepage backing files were always removed.
In this case, the kernel clears the page to prevent data leaks,
because the mapped memory may contain leftover data
from the previous process that was using this memory.
Clearing takes the bulk of the time spent in mmap(2),
increasing EAL initialization time.

Introduce a mode to keep existing files and reuse them
in order to speed up initial memory allocation in EAL.
Hugepages mapped from such files may contain data
left by the previous process that used this memory,
so RTE_MEMSEG_FLAG_DIRTY is set for their segments.
If multiple hugepages are mapped from the same file:
1. When fallocate(2) is used, all memory mapped from this file
   is considered dirty, because it is unknown
   which parts of the file are holes.
2. When ftruncate(3) is used, memory mapped from this file
   is considered dirty unless the file is extended
   to create a new mapping, which implies clean memory.

Signed-off-by: Dmitry Kozlyuk 
---
 lib/eal/common/eal_internal_cfg.h |   2 +
 lib/eal/linux/eal_hugepage_info.c |  59 +++
 lib/eal/linux/eal_memalloc.c  | 157 ++
 3 files changed, 140 insertions(+), 78 deletions(-)

diff --git a/lib/eal/common/eal_internal_cfg.h 
b/lib/eal/common/eal_internal_cfg.h
index b5e6942578..3685aa7c52 100644
--- a/lib/eal/common/eal_internal_cfg.h
+++ b/lib/eal/common/eal_internal_cfg.h
@@ -44,6 +44,8 @@ struct simd_bitwidth {
 struct hugepage_file_discipline {
/** Unlink files before mapping them to leave no trace in hugetlbfs. */
bool unlink_before_mapping;
+   /** Reuse existing files, never delete or re-create them. */
+   bool keep_existing;
 };
 
 /**
diff --git a/lib/eal/linux/eal_hugepage_info.c 
b/lib/eal/linux/eal_hugepage_info.c
index 9fb0e968db..55debdedf0 100644
--- a/lib/eal/linux/eal_hugepage_info.c
+++ b/lib/eal/linux/eal_hugepage_info.c
@@ -84,7 +84,7 @@ static int get_hp_sysfs_value(const char *subdir, const char 
*file, unsigned lon
 /* this function is only called from eal_hugepage_info_init which itself
  * is only called from a primary process */
 static uint32_t
-get_num_hugepages(const char *subdir, size_t sz)
+get_num_hugepages(const char *subdir, size_t sz, unsigned int reusable_pages)
 {
unsigned long resv_pages, num_pages, over_pages, surplus_pages;
const char *nr_hp_file = "free_hugepages";
@@ -116,7 +116,7 @@ get_num_hugepages(const char *subdir, size_t sz)
else
over_pages = 0;
 
-   if (num_pages == 0 && over_pages == 0)
+   if (num_pages == 0 && over_pages == 0 && reusable_pages)
RTE_LOG(WARNING, EAL, "No available %zu kB hugepages 
reported\n",
sz >> 10);
 
@@ -124,6 +124,10 @@ get_num_hugepages(const char *subdir, size_t sz)
if (num_pages < over_pages) /* overflow */
num_pages = UINT32_MAX;
 
+   num_pages += reusable_pages;
+   if (num_pages < reusable_pages) /* overflow */
+   num_pages = UINT32_MAX;
+
/* we want to return a uint32_t and more than this looks suspicious
 * anyway ... */
if (num_pages > UINT32_MAX)
@@ -298,12 +302,12 @@ get_hugepage_dir(uint64_t hugepage_sz, char *hugedir, int 
len)
 }
 
 /*
- * Clear the hugepage directory of whatever hugepage files
- * there are. Checks if the file is locked (i.e.
- * if it's in use by another DPDK process).
+ * Search the hugepage directory for whatever hugepage files there are.
+ * Check if the file is in use by another DPDK process.
+ * If not, either remove it, or keep and count the page as reusable.
  */
 static int
-clear_hugedir(const char * hugedir)
+clear_hugedir(const char *hugedir, bool keep, unsigned int *reusable_pages)
 {
DIR *dir;
struct dirent *dirent;
@@ -346,8 +350,12 @@ clear_hugedir(const char * hugedir)
lck_result = flock(fd, LOCK_EX | LOCK_NB);
 
/* if lock succeeds, remove the file */
-   if (lck_result != -1)
-   unlinkat(dir_fd, dirent->d_name, 0);
+   if (lck_result != -1) {
+   if (keep)
+   (*reusable_pages)++;
+   else
+   unlinkat(dir_fd, dirent->d_name, 0);
+   }
close (fd);
dirent = readdir(dir);
}
@@ -375,7 +383,8 @@ compare_hpi(const void *a, const void *b)
 }
 
 static void
-calc_num_pages(struct hugepage_info *hpi, struct dirent *dirent)
+calc_num_pages(struct hugepage_info *hpi, struct dirent *dirent,
+   unsigned int reusable_pages)
 {
uint64_t total_pages = 0;
unsigned int i;
@@ -388,8 +397,15 @@ calc_num_pages(struct hugepage_info *hpi, struct dirent 
*dirent)
 * in one socket and sorting them later
 */
total_pages = 0;
- 

[RFC PATCH 5/6] eal: allow hugepage file reuse with --huge-unlink

2021-12-30 Thread Dmitry Kozlyuk
Expose Linux EAL ability to reuse existing hugepage files
via --huge-unlink=never switch.
Default behavior is unchanged, it can also be specified
using --huge-unlink=existing for consistency.
Old --huge-unlink switch is kept,
it is an alias for --huge-unlink=always.

Signed-off-by: Dmitry Kozlyuk 
---
 doc/guides/linux_gsg/linux_eal_parameters.rst | 21 --
 .../prog_guide/env_abstraction_layer.rst  |  9 +
 doc/guides/rel_notes/release_22_03.rst|  7 
 lib/eal/common/eal_common_options.c   | 39 +--
 4 files changed, 69 insertions(+), 7 deletions(-)

diff --git a/doc/guides/linux_gsg/linux_eal_parameters.rst 
b/doc/guides/linux_gsg/linux_eal_parameters.rst
index 74df2611b5..64cd73b497 100644
--- a/doc/guides/linux_gsg/linux_eal_parameters.rst
+++ b/doc/guides/linux_gsg/linux_eal_parameters.rst
@@ -84,10 +84,23 @@ Memory-related options
 Use specified hugetlbfs directory instead of autodetected ones. This can be
 a sub-directory within a hugetlbfs mountpoint.
 
-*   ``--huge-unlink``
-
-Unlink hugepage files after creating them (implies no secondary process
-support).
+*   ``--huge-unlink[=existing|always|never]``
+
+No ``--huge-unlink`` option or ``--huge-unlink=existing`` is the default:
+existing hugepage files are removed and re-created
+to ensure the kernel clears the memory and prevents any data leaks.
+
+With ``--huge-unlink`` (no value) or ``--huge-unlink=always``,
+hugepage files are also removed after creating them,
+so that the application leaves no files in hugetlbfs.
+This mode implies no multi-process support.
+
+When ``--huge-unlink=never`` is specified, existing hugepage files
+are not removed neither before nor after mapping them.
+This makes restart faster by saving time to clear memory at initialization,
+but it may slow down zeroed allocations later.
+Reused hugepages can contain data from previous processes that used them,
+which may be a security concern.
 
 *   ``--match-allocations``
 
diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst 
b/doc/guides/prog_guide/env_abstraction_layer.rst
index 6cddb86467..d8940f5e2e 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -277,6 +277,15 @@ to prevent data leaks from previous users of the same 
hugepage.
 EAL ensures this behavior by removing existing backing files at startup
 and by recreating them before opening for mapping (as a precaution).
 
+One expection is ``--huge-unlink=never`` mode.
+It is used to speed up EAL initialization, usually on application restart.
+Clearing memory constitutes more than 95% of hugepage mapping time.
+EAL can save it by remapping existing backing files
+with all the data left in the mapped hugepages ("dirty" memory).
+Such segments are marked with ``RTE_MEMSEG_FLAG_DIRTY``.
+Memory allocator detects dirty segments handles them accordingly,
+in particular, it clears memory requested with ``rte_zmalloc*()``.
+
 Anonymous mapping does not allow multi-process architecture,
 but it is free of filename conflicts and leftover files on hugetlbfs.
 If memfd_create(2) is supported both at build and run time,
diff --git a/doc/guides/rel_notes/release_22_03.rst 
b/doc/guides/rel_notes/release_22_03.rst
index 6d99d1eaa9..0b882362cf 100644
--- a/doc/guides/rel_notes/release_22_03.rst
+++ b/doc/guides/rel_notes/release_22_03.rst
@@ -55,6 +55,13 @@ New Features
  Also, make sure to start the actual text at the margin.
  ===
 
+* **Added ability to reuse hugepages in Linux.**
+
+  It is possible to reuse files in hugetlbfs to speed up hugepage mapping,
+  which may be useful for fast restart and large allocations.
+  The new mode is activated with ``--huge-unlink=never``
+  and has security implications, refer to the user and programmer guides.
+
 
 Removed Items
 -
diff --git a/lib/eal/common/eal_common_options.c 
b/lib/eal/common/eal_common_options.c
index 7520ebda8e..905a7769bd 100644
--- a/lib/eal/common/eal_common_options.c
+++ b/lib/eal/common/eal_common_options.c
@@ -74,7 +74,7 @@ eal_long_options[] = {
{OPT_FILE_PREFIX,   1, NULL, OPT_FILE_PREFIX_NUM  },
{OPT_HELP,  0, NULL, OPT_HELP_NUM },
{OPT_HUGE_DIR,  1, NULL, OPT_HUGE_DIR_NUM },
-   {OPT_HUGE_UNLINK,   0, NULL, OPT_HUGE_UNLINK_NUM  },
+   {OPT_HUGE_UNLINK,   2, NULL, OPT_HUGE_UNLINK_NUM  },
{OPT_IOVA_MODE, 1, NULL, OPT_IOVA_MODE_NUM},
{OPT_LCORES,1, NULL, OPT_LCORES_NUM   },
{OPT_LOG_LEVEL, 1, NULL, OPT_LOG_LEVEL_NUM},
@@ -1596,6 +1596,28 @@ available_cores(void)
return str;
 }
 
+#define HUGE_UNLINK_NEVER "never"
+
+static int
+eal_parse_huge_unlink(const char *arg, struct hugepage_file_discipline *out)
+{
+   if (arg 

[RFC PATCH 6/6] app/test: add allocator performance benchmark

2021-12-30 Thread Dmitry Kozlyuk
Memory allocator performance is crucial to applications that deal
with large amount of memory or allocate frequently. DPDK allocator
performance is affected by EAL options, API used and, at least,
allocation size. New autotest is intended to be run with different
EAL options. It measures performance with a range of sizes
for dirrerent APIs: rte_malloc, rte_zmalloc, and rte_memzone_reserve.

Work distribution between allocation and deallocation depends on EAL
options. The test prints both times and total time to ease comparison.

Memory can be filled with zeroes at different points of allocation path,
but it always takes considerable fraction of overall timing. This is why
the test measures filling speed and prints how long clearing takes
for each size as a reference (for rte_memzone_reserve estimations
are printed).

Signed-off-by: Dmitry Kozlyuk 
Reviewed-by: Viacheslav Ovsiienko 
---
 app/test/meson.build|   2 +
 app/test/test_malloc_perf.c | 174 
 2 files changed, 176 insertions(+)
 create mode 100644 app/test/test_malloc_perf.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 2b480adfba..899034fc2a 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -88,6 +88,7 @@ test_sources = files(
 'test_lpm6_perf.c',
 'test_lpm_perf.c',
 'test_malloc.c',
+'test_malloc_perf.c',
 'test_mbuf.c',
 'test_member.c',
 'test_member_perf.c',
@@ -295,6 +296,7 @@ extra_test_names = [
 
 perf_test_names = [
 'ring_perf_autotest',
+'malloc_perf_autotest',
 'mempool_perf_autotest',
 'memcpy_perf_autotest',
 'hash_perf_autotest',
diff --git a/app/test/test_malloc_perf.c b/app/test/test_malloc_perf.c
new file mode 100644
index 00..9686fc8af5
--- /dev/null
+++ b/app/test/test_malloc_perf.c
@@ -0,0 +1,174 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2021 NVIDIA Corporation & Affiliates
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "test.h"
+
+#define TEST_LOG(level, ...) RTE_LOG(level, USER1, __VA_ARGS__)
+
+typedef void * (alloc_t)(const char *name, size_t size, unsigned int align);
+typedef void (free_t)(void *addr);
+typedef void * (memset_t)(void *addr, int value, size_t size);
+
+static const uint64_t KB = 1 << 10;
+static const uint64_t GB = 1 << 30;
+
+static double
+tsc_to_us(uint64_t tsc, size_t runs)
+{
+   return (double)tsc / rte_get_tsc_hz() * US_PER_S / runs;
+}
+
+static int
+test_memset_perf(double *us_per_gb)
+{
+   static const size_t RUNS = 20;
+
+   void *ptr;
+   size_t i;
+   uint64_t tsc;
+
+   TEST_LOG(INFO, "Reference: memset\n");
+
+   ptr = rte_malloc(NULL, GB, 0);
+   if (ptr == NULL) {
+   TEST_LOG(ERR, "rte_malloc(size=%"PRIx64") failed\n", GB);
+   return -1;
+   }
+
+   tsc = rte_rdtsc_precise();
+   for (i = 0; i < RUNS; i++)
+   memset(ptr, 0, GB);
+   tsc = rte_rdtsc_precise() - tsc;
+
+   *us_per_gb = tsc_to_us(tsc, RUNS);
+   TEST_LOG(INFO, "Result: %f.3 GiB/s <=> %.2f us/MiB\n",
+   US_PER_S / *us_per_gb, *us_per_gb / KB);
+
+   rte_free(ptr);
+   TEST_LOG(INFO, "\n");
+   return 0;
+}
+
+static int
+test_alloc_perf(const char *name, alloc_t *alloc_fn, free_t *free_fn,
+   memset_t *memset_fn, double memset_gb_us, size_t max_runs)
+{
+   static const size_t SIZES[] = {
+   1 << 6, 1 << 7, 1 << 10, 1 << 12, 1 << 16, 1 << 20,
+   1 << 21, 1 << 22, 1 << 24, 1 << 30 };
+
+   size_t i, j;
+   void **ptrs;
+
+   TEST_LOG(INFO, "Performance: %s\n", name);
+
+   ptrs = calloc(max_runs, sizeof(ptrs[0]));
+   if (ptrs == NULL) {
+   TEST_LOG(ERR, "Cannot allocate memory for pointers");
+   return -1;
+   }
+
+   TEST_LOG(INFO, "%12s%8s%12s%12s%12s%17s\n", "Size (B)", "Runs",
+   "Alloc (us)", "Free (us)", "Total (us)",
+   memset_fn != NULL ? "memset (us)" : "est.memset (us)");
+   for (i = 0; i < RTE_DIM(SIZES); i++) {
+   size_t size = SIZES[i];
+   size_t runs_done;
+   uint64_t tsc_start, tsc_alloc, tsc_memset = 0, tsc_free;
+   double alloc_time, free_time, memset_time;
+
+   tsc_start = rte_rdtsc_precise();
+   for (j = 0; j < max_runs; j++) {
+   ptrs[j] = alloc_fn(NULL, size, 0);
+   if (ptrs[j] == NULL)
+   break;
+   }
+   tsc_alloc = rte_rdtsc_precise() - tsc_start;
+
+   if (j == 0) {
+   TEST_LOG(INFO, "%12zu Interrupted: out of memory.\n",
+   size);
+   break;
+   }
+   runs_done = j;
+
+   if (memset_fn != 

Re: [PATCH v1 1/1] vhost: integrate dmadev in asynchronous datapath

2021-12-30 Thread Liang Ma
On Thu, Dec 30, 2021 at 04:55:05PM -0500, Jiayu Hu wrote:
> Since dmadev is introduced in 21.11, to avoid the overhead of vhost DMA
> abstraction layer and simplify application logics, this patch integrates
> dmadev in asynchronous data path.
> 
> Signed-off-by: Jiayu Hu 
> Signed-off-by: Sunil Pai G 
> ---
>  doc/guides/prog_guide/vhost_lib.rst |  70 -
>  examples/vhost/Makefile |   2 +-
>  examples/vhost/ioat.c   | 218 --
>  examples/vhost/ioat.h   |  63 
>  examples/vhost/main.c   | 230 +++-
>  examples/vhost/main.h   |  11 ++
>  examples/vhost/meson.build  |   6 +-
>  lib/vhost/meson.build   |   3 +-
>  lib/vhost/rte_vhost_async.h | 121 +--
>  lib/vhost/version.map   |   3 +
>  lib/vhost/vhost.c   | 130 +++-
>  lib/vhost/vhost.h   |  53 ++-
>  lib/vhost/virtio_net.c  | 206 +++--
>  13 files changed, 587 insertions(+), 529 deletions(-)
>  delete mode 100644 examples/vhost/ioat.c
>  delete mode 100644 examples/vhost/ioat.h
> 

> diff --git a/examples/vhost/main.c b/examples/vhost/main.c
> index 33d023aa39..44073499bc 100644
> --- a/examples/vhost/main.c
> +++ b/examples/vhost/main.c
> @@ -24,8 +24,9 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  
> -#include "ioat.h"
>  #include "main.h"
>  
>  #ifndef MAX_QUEUES
> @@ -56,6 +57,14 @@
>  #define RTE_TEST_TX_DESC_DEFAULT 512
>  
>  #define INVALID_PORT_ID 0xFF
> +#define INVALID_DMA_ID -1
> +
> +#define MAX_VHOST_DEVICE 1024
> +#define DMA_RING_SIZE 4096
> +
> +struct dma_for_vhost dma_bind[MAX_VHOST_DEVICE];
> +struct rte_vhost_async_dma_info dma_config[RTE_DMADEV_DEFAULT_MAX];
> +static int dma_count;
>  
>  /* mask of enabled ports */
>  static uint32_t enabled_port_mask = 0;
> @@ -96,8 +105,6 @@ static int builtin_net_driver;
>  
>  static int async_vhost_driver;
>  
> -static char *dma_type;
> -
>  /* Specify timeout (in useconds) between retries on RX. */
>  static uint32_t burst_rx_delay_time = BURST_RX_WAIT_US;
>  /* Specify the number of retries on RX. */
> @@ -196,13 +203,134 @@ struct vhost_bufftable *vhost_txbuff[RTE_MAX_LCORE * 
> MAX_VHOST_DEVICE];
>  #define MBUF_TABLE_DRAIN_TSC ((rte_get_tsc_hz() + US_PER_S - 1) \
>/ US_PER_S * BURST_TX_DRAIN_US)
>  
> +static inline bool
> +is_dma_configured(int16_t dev_id)
> +{
> + int i;
> +
> + for (i = 0; i < dma_count; i++) {
> + if (dma_config[i].dev_id == dev_id) {
> + return true;
> + }
> + }
> + return false;
> +}
> +
>  static inline int
>  open_dma(const char *value)
>  {
> - if (dma_type != NULL && strncmp(dma_type, "ioat", 4) == 0)
> - return open_ioat(value);
> + struct dma_for_vhost *dma_info = dma_bind;
> + char *input = strndup(value, strlen(value) + 1);
> + char *addrs = input;
> + char *ptrs[2];
> + char *start, *end, *substr;
> + int64_t vid, vring_id;
> +
> + struct rte_dma_info info;
> + struct rte_dma_conf dev_config = { .nb_vchans = 1 };
> + struct rte_dma_vchan_conf qconf = {
> + .direction = RTE_DMA_DIR_MEM_TO_MEM,
> + .nb_desc = DMA_RING_SIZE
> + };
> +
> + int dev_id;
> + int ret = 0;
> + uint16_t i = 0;
> + char *dma_arg[MAX_VHOST_DEVICE];
> + int args_nr;
> +
> + while (isblank(*addrs))
> + addrs++;
> + if (*addrs == '\0') {
> + ret = -1;
> + goto out;
> + }
> +
> + /* process DMA devices within bracket. */
> + addrs++;
> + substr = strtok(addrs, ";]");
> + if (!substr) {
> + ret = -1;
> + goto out;
> + }
> +
> + args_nr = rte_strsplit(substr, strlen(substr),
> + dma_arg, MAX_VHOST_DEVICE, ',');
> + if (args_nr <= 0) {
> + ret = -1;
> + goto out;
> + }
> +
> + while (i < args_nr) {
> + char *arg_temp = dma_arg[i];
> + uint8_t sub_nr;
> +
> + sub_nr = rte_strsplit(arg_temp, strlen(arg_temp), ptrs, 2, '@');
> + if (sub_nr != 2) {
> + ret = -1;
> + goto out;
> + }
> +
> + start = strstr(ptrs[0], "txd");
Hi JiaYu, 
it looks the parameter checking ignore the "rxd" case ? I think if
the patch enable enqueue/dequeue at same time. rxd is needed for
DMAS parameters.
Regards
Liang
 


RE: [PATCH v1] config/arm: add armv7 native config

2021-12-30 Thread Ruifeng Wang
> -Original Message-
> From: Juraj Linkeš 
> Sent: Thursday, November 18, 2021 6:46 PM
> To: tho...@monjalon.net; david.march...@redhat.com;
> bruce.richard...@intel.com; Honnappa Nagarahalli
> ; Ruifeng Wang
> ; ferruh.yi...@intel.com;
> christian.ehrha...@canonical.com
> Cc: dev@dpdk.org; Juraj Linkeš 
> Subject: [PATCH v1] config/arm: add armv7 native config
> 
> Arvm7 native build fails with this error:

Typo, 'Armv7'

> ../config/meson.build:364:1: ERROR: Problem encountered:
> Number of CPU cores not specified.
> 
> This is because RTE_MAX_LCORE is not set. We also need to set
> RTE_MAX_NUMA_NODES in armv7 native builds.
> 
> Fixes: 8ef09fdc506b ("build: add optional NUMA and CPU counts detection")
> 
> Signed-off-by: Juraj Linkeš 
> ---
>  config/arm/meson.build | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/config/arm/meson.build b/config/arm/meson.build index
> 213324d262..57980661b2 100644
> --- a/config/arm/meson.build
> +++ b/config/arm/meson.build
> @@ -432,6 +432,8 @@ if dpdk_conf.get('RTE_ARCH_32')
>  else
>  # armv7 build
>  dpdk_conf.set('RTE_ARCH_ARMv7', true)
> +dpdk_conf.set('RTE_MAX_LCORE', 128)
> +dpdk_conf.set('RTE_MAX_NUMA_NODES', 8)
>  # the minimum architecture supported, armv7-a, needs the following,
>  machine_args += '-mfpu=neon'
>  endif
> --
> 2.20.1
Acked-by: Ruifeng Wang