> -----Original Message----- > From: Chenbo Xia <chen...@nvidia.com> > Sent: Monday, October 21, 2024 2:35 AM > To: Wathsala Wathawana Vithanage <wathsala.vithan...@arm.com> > Cc: dev@dpdk.org; nd <n...@arm.com> > Subject: Re: [RFC v3 0/2] An API for Stashing Packets into CPU caches > > Hi, > > > On Oct 21, 2024, at 09:52, Wathsala Vithanage > <wathsala.vithan...@arm.com> wrote: > > > > External email: Use caution opening links or attachments > > > > > > DPDK applications benefit from Direct Cache Access (DCA) features like > > Intel DDIO and Arm's write-allocate-to-SLC. However, those features do > > not allow fine-grained control of direct cache access, such as > > stashing packets into upper-level caches (L2 caches) of a processor or > > the shared cache of a chiplet. PCIe TLP Processing Hints (TPH) > > addresses this need in a vendor-agnostic manner. TPH capability has > > existed since PCI Express Base Specification revision 3.0; today, > > numerous Network Interface Cards and interconnects from different > > vendors support TPH capability. TPH comprises a steering tag (ST) and > > a processing hint (PH). ST specifies the cache level of a CPU at which > > the data should be written to (or DCAed into), while PH is a hint > > provided by the PCIe requester to the completer on an upcoming traffic > > pattern. Some NIC vendors bundle TPH capability with fine-grained > > control over the type of objects that can be stashed into CPU caches, > > such as > > > > - Rx/Tx queue descriptors > > - Packet-headers > > - Packet-payloads > > - Data from a given offset from the start of a packet > > > > Note that stashable object types are outside the scope of PCIe > > standard; therefore, vendors could support any combination of the > > above items as they see fit. > > > > To enable TPH and fine-grained packet stashing, this API extends the > > ethdev library, PCI library, and the PCI driver. In this design, the > > application via the ethdev stashing API provides hints to the PMD to > > indicate the underlying hardware at which processor and cache level it > > prefers a packet to end up. Once the PMD receives a CPU and a > > cache-level combination, it must extract the matching ST from the TPH > > ACPI _DSM of the PCIe root port to which the NIC is connected. To > > facilitate the extraction of STs, the PCI library and the PCI driver > > APIs are extended. > > > > PMD's implementation of eth_dev_ops stashing_rx_hints_set and > > stashing_tx_hints_set function pointers are responsible for extracting > > the ST. The PCI bus driver provides the generic TPH ST extraction API > > that can be used by any PMD that drives a PCIe device. The extraction > > process begins by calling rte_pci_extract_tph_st() function in > > drivers/bus/pci/rte_bus_pci.h, which takes an initialized input object > > rte_tph_acpi__dsm_args and a pointer to rte_tph_acpi__dsm_return to > > store the ST returned by the TPH _DSM. rte_tph_acpi__dsm_arg and > > rte_tph_acpi__dsm_return objects are defined in lib/pci/rte_pci_tph.h > > as defined by the PCIe firmware specification and the associated ECN > > titled "Revised _DSM for Cache Locality TPH Features". The helper > > function rte_init_tph_acpi__dsm_args is used by the > > rte_pci_extract_tph_st() to convert lcore_id and cache_level provided > > by the PMD into well-formatted rte_tph_acpi__dsm_args. The processor > > or, in some cases, a container ID (which is synonymous with a core > > complex of a chiplet die) and the cache level in the > > rte_tph_acpi__dsm_args structure are not the same as the lcore_id and > > the cache_level provided by the application to the ethdev library, > > which PMD passes down to the rte_pci_extract_st() function. The > > rte_init_tph_acpi__dsm_args helper converts lcore_id to an APIC > > processor-id or a PPTT processor-container-id if the container of the > > lcore_id was requested as the target by the application. Similarly, it > > must convert cache_level to a PPTT cache-reference-id. These > > conversions are possible with the hwloc library or some other library > > DPDK may eventually provide. However, DPDK cannot execute the TPH > _DSM > > directly, as it can only be done with kernel privileges. Therefore, > > appropriate mechanisms must be established in supported Operating > Systems(Linux, FreeBSD, and Windows) to expose the _DSM return for a given > argument. > > For instance, on Linux, this mechanism could be sysfs. Therefore, the > > implementation of rte_pci_extract_tph_st() is done in OS-specific > > files drivers/bus/pci/{bsd, linux, windows}/pci.c. > > > > Once the ST is acquired from the OS-specific method described earlier, > > the stashing_rx_hints_set/stashing_tx_hints_set PMD implementations > > are ready to set the ST. As per PCIe specification, hints can be put > > on the MSI-X tables or using a device-specific method. Considering > > this, many NICs that support TPH allow setting steering tags and > > processing hints on the device's MSI-X table and queue contexts. For > > PMDs, setting the ST on queue contexts is the only viable method of > > using TPH. Therefore, the DPDK can only support setting ST in queue > > contexts. An application uses the cache stashing ethdev API by first > > calling the > > rte_eth_dev_stashing_capabilities_get() function to find out what > > object types can be stashed into a processor cache by the NIC out of > > the object types in the bulleted list above. This function takes a > > port_id and a pointer to a uint16_t to report back the object type > > flags. PMD implements the stashing_capabilities_get function pointer > > in eth_dev_ops. If the underlying platform or the NIC does not support > > TPH, this function returns -ENOTSUP and the application should > > consider any values stored in the objects pointer invalid. > > > > Once the application knows the supported object types that can be > > stashed, the next step is to set the steering tags for the packets > > associated with Rx and Tx queues via > > rte_eth_dev_stashing_rx_config_set() and > > rte_eth_dev_stashing_tx_config_set() ethdev library function > > respectively. These functions execute the rte_pci_extract_tph_st() > > via eth_dev_ops pointers stashing_rx_hints_set and stashing_tx_hints_set. > > Both the functions have an identical signature, a port_id, a queue_id, > > and a config object. The port_id and the queue-id are used to locate > > the device and the queue. The config object is of type struct > > rte_eth_stashing_config, which specifies the lcore_id and the > > cache_level, indicating where objects from this queue should be stashed. > > It also has the field 'container' to indicate if the target should be > > the container of the processor specified by the lcore_id in a > > chiplet-based SoC. The 'objects' field in the config sets the types of > > objects the application wishes to stash based on the capabilities > > found earlier. If the objects field includes the flag > > RTE_ETH_DEV_STASH_OBJECT_OFFSET, the 'offset' field must be used to > > set the desired offset. These functions invoke PMD implementations of > > the stashing functionality via stashing_rx_hints_set and > > stashing_tx_hints_set, function pointers in eth_dev_ops, respectively. > > > > > > Wathsala Vithanage (2): > > pci: introduce the PCIe TLP Processing Hints API > > ethdev: introduce the cache stashing hints API > > > > drivers/bus/pci/bsd/pci.c | 12 +++ > > drivers/bus/pci/linux/pci.c | 12 +++ > > drivers/bus/pci/rte_bus_pci.h | 22 +++++ > > drivers/bus/pci/version.map | 3 + > > drivers/bus/pci/windows/pci.c | 14 +++ > > lib/ethdev/ethdev_driver.h | 66 ++++++++++++++ > > lib/ethdev/rte_ethdev.c | 120 ++++++++++++++++++++++++++ > > lib/ethdev/rte_ethdev.h | 156 > ++++++++++++++++++++++++++++++++++ > > lib/ethdev/version.map | 4 + > > lib/pci/meson.build | 2 + > > lib/pci/rte_pci.h | 2 + > > lib/pci/rte_pci_tph.c | 20 +++++ > > lib/pci/rte_pci_tph.h | 111 ++++++++++++++++++++++++ > > 13 files changed, 544 insertions(+) > > create mode 100644 lib/pci/rte_pci_tph.c create mode 100644 > > lib/pci/rte_pci_tph.h > > > > — > > 2.34.1 > > > > Do you have some numbers about how much performance this feature can > improve? >
This patch requires some additional work done in the Linux kernel to get it working. I'm planning to test this on a supported HW platform soon by hardcoding some STs. The TPH enablement patch in the kernel reports a significant improvement. https://patchew.org/linux/20240927215653.1552411-1-wei.hua...@amd.com/ I hope it will improve performance in DPDK too. Please join the call scheduled for 10/23/24 to discuss what we need in the OS to support this feature. https://inbox.dpdk.org/dev/pawpr08mb890901574a3113840e7d7ccc9f...@pawpr08mb8909.eurprd08.prod.outlook.com/ Thanks. --wathsala