On 01/02/2018 16:24, Michael S. Tsirkin wrote: > On Thu, Feb 01, 2018 at 02:29:25PM +0200, Marcel Apfelbaum wrote: >> On 01/02/2018 14:10, Eduardo Habkost wrote: >>> On Thu, Feb 01, 2018 at 07:36:50AM +0200, Marcel Apfelbaum wrote: >>>> On 01/02/2018 4:22, Michael S. Tsirkin wrote: >>>>> On Wed, Jan 31, 2018 at 09:34:22PM -0200, Eduardo Habkost wrote: >>> [...] >>>>>> BTW, what's the root cause for requiring HVAs in the buffer? >>>>> >>>>> It's a side effect of the kernel/userspace API which always wants >>>>> a single HVA/len pair to map memory for the application. >>>>> >>>>> >>>> >>>> Hi Eduardo and Michael, >>>> >>>>>> Can >>>>>> this be fixed? >>>>> >>>>> I think yes. It'd need to be a kernel patch for the RDMA subsystem >>>>> mapping an s/g list with actual memory. The HVA/len pair would then just >>>>> be used to refer to the region, without creating the two mappings. >>>>> >>>>> Something like splitting the register mr into >>>>> >>>>> mr = create mr (va/len) - allocate a handle and record the va/len >>>>> >>>>> addmemory(mr, offset, hva, len) - pin memory >>>>> >>>>> register mr - pass it to HW >>>>> >>>>> As a nice side effect we won't burn so much virtual address space. >>>>> >>>> >>>> We would still need a contiguous virtual address space range (for >>>> post-send) >>>> which we don't have since guest contiguous virtual address space >>>> will always end up as non-contiguous host virtual address space. >>>> >>>> I am not sure the RDMA HW can handle a large VA with holes. >>> >>> I'm confused. Why would the hardware see and care about virtual >>> addresses? >> >> The post-send operations bypasses the kernel, and the process >> puts in the work request GVA addresses. > > To be more precise, it's the guest supplied IOVA that is sent to the card. > >>> How exactly does the hardware translates VAs to >>> PAs? >> >> The HW maintains a page-directory like structure different form MMU >> VA -> phys pages >> >>> What if the process page tables change? >>> >> >> Since the page tables the HW uses are their own, we just need the phys >> page to be pinned. >> >>>> >>>> An alternative would be 0-based MR, QEMU intercepts the post-send >>>> operations and can substract the guest VA base address. >>>> However I didn't see the implementation in kernel for 0 based MRs >>>> and also the RDMA maintainer said it would work for local keys >>>> and not for remote keys. >>> >>> This is also unexpected: are GVAs visible to the virtual RDMA >>> hardware? >> >> Yes, explained above >> >>> Where does the QEMU pvrdma code translates GVAs to >>> GPAs? >>> >> >> During reg_mr (memory registration commands) >> Then it registers the same addresses to the real HW. >> (as Host virtual addresses) >> >> Thanks, >> Marcel > > > The full fix would be to allow QEMU to map a list of > pages to a guest supplied IOVA. >
Agreed, we are trying to influence the RDMA discussion on the new API in this direction. Thanks, Marcel >>>> >>>>> This will fix rdma with hugetlbfs as well which is currently broken. >>>>> >>>>> >>>> >>>> There is already a discussion on the linux-rdma list: >>>> https://www.spinics.net/lists/linux-rdma/msg60079.html >>>> But it will take some (actually a lot of) time, we are currently talking >>>> about >>>> a possible API. And it does not solve the re-mapping... >>>> >>>> Thanks, >>>> Marcel >>>> >>>>>> -- >>>>>> Eduardo >>>> >>>