On Thu, Feb 14, 2019 at 12:22 PM Alejandro Lucero < alejandro.luc...@netronome.com> wrote:
> > > On Wed, Feb 13, 2019 at 7:24 PM Shahaf Shuler <shah...@mellanox.com> > wrote: > >> Wednesday, February 13, 2019 1:43 PM, Alejandro Lucero: >> > Subject: Re: [dpdk-dev] [PATCH 0/6] introduce DMA memory mapping for >> > external memory >> > >> > On Wed, Feb 13, 2019 at 9:11 AM Shahaf Shuler <shah...@mellanox.com> >> > wrote: >> > >> > > This series is in continue to RFC[1]. >> > > >> > > The DPDK APIs expose 3 different modes to work with memory used for >> > DMA: >> > > >> > > 1. Use the DPDK owned memory (backed by the DPDK provided >> > hugepages). >> > > This memory is allocated by the DPDK libraries, included in the DPDK >> > > memory system (memseg lists) and automatically DMA mapped by the >> > DPDK >> > > layers. >> > > >> > > 2. Use memory allocated by the user and register to the DPDK memory >> > > systems. This is also referred as external memory. Upon registration >> > > of the external memory, the DPDK layers will DMA map it to all needed >> > > devices. >> > > >> > > 3. Use memory allocated by the user and not registered to the DPDK >> > > memory system. This is for users who wants to have tight control on >> > > this memory. The user will need to explicitly call DMA map function in >> > > order to register such memory to the different devices. >> > > >> > > The scope of the patch focus on #3 above. >> > > >> > > >> > Why can not we have case 2 covering case 3? >> >> Because it is not our choice rather the DPDK application. >> We could not allow it, and force the application to register their >> external memory to the DPDK memory management system. However IMO it will >> be wrong. >> The use case exists - some application wants to manage their memory by >> themselves. w/o the extra overhead of rte_malloc, without creating a >> special socket to populate the memory and without redundant API calls to >> rte_extmem_*. >> >> Simply allocate chunk of memory, DMA map it to device and that’s it. >> >> > Usability is a strong point, but up to some extent. DPDK is all about > performance, and adding options the user can choose from will add pressure > and complexity for keeping the performance. Your proposal makes sense from > an user point of view, but will it avoid to modify things in the DPDK core > for supporting this case broadly in the future? Multiprocess will be hard > to get, if not impossible, without adding more complexity, and although you > likely do not expect that use case requiring multiprocess support, once we > have DPDK apps using this model, sooner or later those companies with > products based on such option will demand broadly support. I can foresee > not just multiprocess support will require changes in the future. > > This reminds me the case of obtaining real time: the more complexity the > less determinism can be obtained. It is not impossible, simply it is far > more complex. Pure real time operating systems can add new functionalities, > but it is hard to do it properly without jeopardising the main goal. > Generic purpose operating systems can try to improve determinism, but up to > some extent and with important complexity costs. DPDK is the real time > operating system in this comparison. > > >> > >> > >> > > Currently the only way to map external memory is through VFIO >> > > (rte_vfio_dma_map). While VFIO is common, there are other vendors >> > > which use different ways to map memory (e.g. Mellanox and NXP). >> > > >> > > >> > As you say, VFIO is common, and when allowing DMAs programmed in user >> > space, the right thing to do. >> >> It is common indeed. Why it the right thing to do? >> >> > Compared with UIO, for sure. VFIO does have the right view of the system > in terms of which devices can properly be isolated. Can you confirm a > specific implementation by a vendor can ensure same behaviour? If so, do > you have duplicated code then? if the answer is your are using VFIO data, > why not to use VFIO as the interface and add the required connection > between VFIO and drivers? > > What about mapping validation? is the driver doing that part or relying on > kernel code? or is it just assuming the mapping is safe? > > >> I'm assuming there is an IOMMU hardware and >> > this is what Mellanox and NXP rely on in some way or another. >> >> For Mellanox, the device works with virtual memory, not physical. If you >> think of it, it is more secure for user space application. Mellanox device >> has internal memory translation unit between virtual memory and physical >> memory. >> IOMMU can be added on top of it, in case the host doesn't trust the >> device or the device is given to untrusted entity like VM. >> >> > Any current NIC or device will work with virtual addresses if IOMMU is in > place, not matter if the device is IOMMU-aware or not. Any vendor, with > that capability in their devices, should follow generic paths and a common > interface with the vendor drivers being the executors. The drivers know how > to tell the device, but they should be told what to tell and not by the > user but by the kernel. > > I think reading your comment "in case the host doesn't trust the device" > makes easier to understand what you try to obtain, and at the same time > makes my concerns not a problem at all. This is a case for DPDK being used > in certain scenarios where the full system is trusted, what I think is a > completely rightful option. My only concern then is the complexity it could > imply sooner or later, although I have to admit it is not a strong one :-) > > I forgot to mention the problem of leaving that option open in not fully trusted systems. I do not know how it could be avoided, maybe some checks in EAL initialization, but maybe this is not possible at all. Anyway, I think this is worth to be discussed further. > > >> > Having each driver doing things in their own way will end up in a >> harder to >> > validate system. >> >> Different vendors will have different HW implementations. We cannot force >> everybody to align the IOMMU. >> What we can do, is to ease the user life and provide vendor agnostic APIs >> which just provide the needed functionality. On our case DMA map and unmap. >> The user should not care if its IOMMU, Mellanox memory registration >> through verbs or NXP special mapping. >> >> The sys admin should set/unset the IOMMU as a general mean of protection. >> And this of course will work also w/ Mellanox devices. >> >> If there is an IOMMU hardware, same mechanism should be >> > used always, leaving to the IOMMU hw specific implementation to deal >> with >> > the details. If a NIC is IOMMU-able, that should not be supported by >> specific >> > vendor drivers but through a generic solution like VFIO which will >> validate a >> > device with such capability and to perform the required actions for >> that case. >> > VFIO and IOMMU should be modified as needed for supporting this >> > requirement instead of leaving vendor drivers to implement their own >> > solution. >> >> Again - I am against of forcing every PCI device to use VFIO, and I don't >> think IOMMU as a HW device should control other PCI devices. >> I see nothing wrong with device which also has extra capabilities of >> memory translation, and adds another level of security to the user >> application. >> >> > In a system with untrusted components using the device, a generic way of > properly configure the system with the right protections should be used > instead of relying on specific vendor implementation. > > >> > >> > In any case, I think this support should be in a different patchset >> than the >> > private user space mappings. >> > >> > > > > >> > > The work in this patch moves the DMA mapping to vendor agnostic APIs. >> > > A new map and unmap ops were added to rte_bus structure. >> > > Implementation of those was done currently only on the PCI bus. The >> > > implementation takes the driver map and umap implementation as bypass >> > to the VFIO mapping. >> > > That is, in case of no specific map/unmap from the PCI driver, VFIO >> > > mapping, if possible, will be used. >> > > >> > > Application use with those APIs is quite simple: >> > > * allocate memory >> > > * take a device, and query its rte_device. >> > > * call the bus map function for this device. >> > > >> > > Future work will deprecate the rte_vfio_dma_map and >> > rte_vfio_dma_unmap >> > > APIs, leaving the PCI device APIs as the preferred option for the >> user. >> > > >> > > [1] >> > > >> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpat >> > > >> > ches.dpdk.org%2Fpatch%2F47796%2F&data=02%7C01%7Cshahafs%40 >> > mellanox >> > > >> > .com%7Cdc209a2ceace48c0452008d691a8762d%7Ca652971c7d2e4d9ba6a4d1 >> > 49256f >> > > >> > 461b%7C0%7C0%7C636856550053348339&sdata=3TEUJfS9jBOsbvaPYwo >> > itQLj7o >> > > h9VCrtaK7We%2FItg5c%3D&reserved=0 >> > > >> > > Shahaf Shuler (6): >> > > vfio: allow DMA map of memory for the default vfio fd >> > > vfio: don't fail to DMA map if memory is already mapped >> > > bus: introduce DMA memory mapping for external memory >> > > net/mlx5: refactor external memory registration >> > > net/mlx5: support PCI device DMA map and unmap >> > > doc: deprecate VFIO DMA map APIs >> > > >> > > doc/guides/prog_guide/env_abstraction_layer.rst | 2 +- >> > > doc/guides/rel_notes/deprecation.rst | 4 + >> > > drivers/bus/pci/pci_common.c | 78 +++++++ >> > > drivers/bus/pci/rte_bus_pci.h | 14 ++ >> > > drivers/net/mlx5/mlx5.c | 2 + >> > > drivers/net/mlx5/mlx5_mr.c | 232 >> ++++++++++++++++--- >> > > drivers/net/mlx5/mlx5_rxtx.h | 5 + >> > > lib/librte_eal/common/eal_common_bus.c | 22 ++ >> > > lib/librte_eal/common/include/rte_bus.h | 57 +++++ >> > > lib/librte_eal/common/include/rte_vfio.h | 12 +- >> > > lib/librte_eal/linuxapp/eal/eal_vfio.c | 26 ++- >> > > lib/librte_eal/rte_eal_version.map | 2 + >> > > 12 files changed, 418 insertions(+), 38 deletions(-) >> > > >> > > -- >> > > 2.12.0 >> > > >> > > >> >