Re: [PATCH 4/7] alpha: provide ioread64 and iowrite64 implementations
> +#define iowrite64be(v,p) iowrite32(cpu_to_be64(v), (p)) Logan, thanks for taking this cleanup on. I think this should be iowrite64 not iowrite32? Stephen ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Enabling peer to peer device transactions for PCIe devices
>> >> The NVMe fabrics stuff could probably make use of this. It's an >> in-kernel system to allow remote access to an NVMe device over RDMA. So >> they ought to be able to optimize their transfers by DMAing directly to >> the NVMe's CMB -- no userspace interface would be required but there >> would need some kernel infrastructure. > > Yes, that's what I was thinking. The NVMe/f driver needs to map the CMB > for RDMA. I guess if it used ZONE_DEVICE like in the iopmem patches it > would be relatively easy to do. > Haggai, yes that was one of the use cases we considered when we put together the patchset.
Enabling peer to peer device transactions for PCIe devices
Hi All This has been a great thread (thanks to Alex for kicking it off) and I wanted to jump in and maybe try and put some summary around the discussion. I also wanted to propose we include this as a topic for LFS/MM because I think we need more discussion on the best way to add this functionality to the kernel. As far as I can tell the people looking for P2P support in the kernel fall into two main camps: 1. Those who simply want to expose static BARs on PCIe devices that can be used as the source/destination for DMAs from another PCIe device. This group has no need for memory invalidation and are happy to use physical/bus addresses and not virtual addresses. 2. Those who want to support devices that suffer from occasional memory pressure and need to invalidate memory regions from time to time. This camp also would like to use virtual addresses rather than physical ones to allow for things like migration. I am wondering if people agree with this assessment? I think something like the iopmem patches Logan and I submitted recently come close to addressing use case 1. There are some issues around routability but based on feedback to date that does not seem to be a show-stopper for an initial inclusion. For use-case 2 it looks like there are several options and some of them (like HMM) have been around for quite some time without gaining acceptance. I think there needs to be more discussion on this usecase and it could be some time before we get something upstreamable. I for one, would really like to see use case 1 get addressed soon because we have consumers for it coming soon in the form of CMBs for NVMe devices. Long term I think Jason summed it up really well. CPU vendors will put high-speed, open, switchable, coherent buses on their processors and all these problems will vanish. But I ain't holding my breathe for that to happen ;-). Cheers Stephen
Enabling peer to peer device transactions for PCIe devices
>>> I've already recommended that iopmem not be a block device and >>> instead be a device-dax instance. I also don't think it should claim >>> the PCI ID, rather the driver that wants to map one of its bars this >>> way can register the memory region with the device-dax core. >>> >>> I'm not sure there are enough device drivers that want to do this to >>> have it be a generic /sys/.../resource_dmableX capability. It still >>> seems to be an exotic one-off type of configuration. >> >> >> Yes, this is essentially my thinking. Except I think the userspace >> interface should really depend on the device itself. Device dax is a >> good choice for many and I agree the block device approach wouldn't be >> ideal. I tend to agree here. The block device interface has seen quite a bit of resistance and /dev/dax looks like a better approach for most. We can look at doing it that way in v2. >> >> Specifically for NVME CMB: I think it would make a lot of sense to just >> hand out these mappings with an mmap call on /dev/nvmeX. I expect CMB >> buffers would be volatile and thus you wouldn't need to keep track of >> where in the BAR the region came from. Thus, the mmap call would just be >> an allocator from BAR memory. If device-dax were used, userspace would >> need to lookup which device-dax instance corresponds to which nvme >> drive. >> > > I'm not opposed to mapping /dev/nvmeX. However, the lookup is trivial > to accomplish in sysfs through /sys/dev/char to find the sysfs path of the > device-dax instance under the nvme device, or if you already have the nvme > sysfs path the dax instance(s) will appear under the "dax" sub-directory. > Personally I think mapping the dax resource in the sysfs tree is a nice way to do this and a bit more intuitive than mapping a /dev/nvmeX.
Re: Enabling peer to peer device transactions for PCIe devices
On Fri, January 6, 2017 4:10 pm, Logan Gunthorpe wrote: > > > On 06/01/17 11:26 AM, Jason Gunthorpe wrote: > > >> Make a generic API for all of this and you'd have my vote.. >> >> >> IMHO, you must support basic pinning semantics - that is necessary to >> support generic short lived DMA (eg filesystem, etc). That hardware can >> clearly do that if it can support ODP. > > I agree completely. > > > What we want is for RDMA, O_DIRECT, etc to just work with special VMAs > (ie. at least those backed with ZONE_DEVICE memory). Then > GPU/NVME/DAX/whatever drivers can just hand these VMAs to userspace > (using whatever interface is most appropriate) and userspace can do what > it pleases with them. This makes _so_ much sense and actually largely > already works today (as demonstrated by iopmem). +1 for iopmem ;-) I feel like we are going around and around on this topic. I would like to see something that is upstream that enables P2P even if it is only the minimum viable useful functionality to begin. I think aiming for the moon (which is what HMM and things like it are) are simply going to take more time if they ever get there. There is a use case for in-kernel P2P PCIe transfers between two NVMe devices and between an NVMe device and an RDMA NIC (using NVMe CMBs or BARs on the NIC). I am even seeing users who now want to move data P2P between FPGAs and NVMe SSDs and the upstream kernel should be able to support these users or they will look elsewhere. The iopmem patchset addressed all the use cases above and while it is not an in kernel API it could have been modified to be one reasonably easily. As Logan states the driver can then choose to pass the VMAs to user-space in a manner that makes sense. Earlier in the thread someone mentioned LSF/MM. There is already a proposal to discuss this topic so if you are interested please respond to the email letting the committee know this topic is of interest to you [1]. Also earlier in the thread someone discussed the issues around the IOMMU. Given the known issues around P2P transfers in certain CPU root complexes [2] it might just be a case of only allowing P2P when a PCIe switch connects the two EPs. Another option is just to use CONFIG_EXPERT and make sure people are aware of the pitfalls if they invoke the P2P option. Finally, as Jason noted, we could all just wait until CAPI/OpenCAPI/CCIX/GenZ comes along. However given that these interfaces are the remit of the CPU vendors I think it behooves us to solve this problem before then. Also some of the above mentioned protocols are not even switchable and may not be amenable to a P2P topology... Stephen [1] http://marc.info/?l=linux-mm&m=148156541804940&w=2 [2] https://community.mellanox.com/docs/DOC-1119 ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel