On Thu, Feb 01, 2018 at 08:58:32PM +0200, Marcel Apfelbaum wrote: > On 01/02/2018 20:51, Eduardo Habkost wrote: > > On Thu, Feb 01, 2018 at 08:31:09PM +0200, Marcel Apfelbaum wrote: > >> On 01/02/2018 20:21, Eduardo Habkost wrote: > >>> On Thu, Feb 01, 2018 at 08:03:53PM +0200, Marcel Apfelbaum wrote: > >>>> On 01/02/2018 15:53, Eduardo Habkost wrote: > >>>>> On Thu, Feb 01, 2018 at 02:29:25PM +0200, Marcel Apfelbaum wrote: > >>>>>> On 01/02/2018 14:10, Eduardo Habkost wrote: > >>>>>>> On Thu, Feb 01, 2018 at 07:36:50AM +0200, Marcel Apfelbaum wrote: > >>>>>>>> On 01/02/2018 4:22, Michael S. Tsirkin wrote: > >>>>>>>>> On Wed, Jan 31, 2018 at 09:34:22PM -0200, Eduardo Habkost wrote: > >>>>>>> [...] > >>>>>>>>>> BTW, what's the root cause for requiring HVAs in the buffer? > >>>>>>>>> > >>>>>>>>> It's a side effect of the kernel/userspace API which always wants > >>>>>>>>> a single HVA/len pair to map memory for the application. > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>>> Hi Eduardo and Michael, > >>>>>>>> > >>>>>>>>>> Can > >>>>>>>>>> this be fixed? > >>>>>>>>> > >>>>>>>>> I think yes. It'd need to be a kernel patch for the RDMA subsystem > >>>>>>>>> mapping an s/g list with actual memory. The HVA/len pair would then > >>>>>>>>> just > >>>>>>>>> be used to refer to the region, without creating the two mappings. > >>>>>>>>> > >>>>>>>>> Something like splitting the register mr into > >>>>>>>>> > >>>>>>>>> mr = create mr (va/len) - allocate a handle and record the va/len > >>>>>>>>> > >>>>>>>>> addmemory(mr, offset, hva, len) - pin memory > >>>>>>>>> > >>>>>>>>> register mr - pass it to HW > >>>>>>>>> > >>>>>>>>> As a nice side effect we won't burn so much virtual address space. > >>>>>>>>> > >>>>>>>> > >>>>>>>> We would still need a contiguous virtual address space range (for > >>>>>>>> post-send) > >>>>>>>> which we don't have since guest contiguous virtual address space > >>>>>>>> will always end up as non-contiguous host virtual address space. > >>>>>>>> > >>>>>>>> I am not sure the RDMA HW can handle a large VA with holes. > >>>>>>> > >>>>>>> I'm confused. Why would the hardware see and care about virtual > >>>>>>> addresses? > >>>>>> > >>>>>> The post-send operations bypasses the kernel, and the process > >>>>>> puts in the work request GVA addresses. > >>>>>> > >>>>>>> How exactly does the hardware translates VAs to > >>>>>>> PAs? > >>>>>> > >>>>>> The HW maintains a page-directory like structure different form MMU > >>>>>> VA -> phys pages > >>>>>> > >>>>>>> What if the process page tables change? > >>>>>>> > >>>>>> > >>>>>> Since the page tables the HW uses are their own, we just need the phys > >>>>>> page to be pinned. > >>>>> > >>>>> So there's no hardware-imposed requirement that the hardware VAs > >>>>> (mapped by the HW page directory) match the VAs in QEMU > >>>>> address-space, right? > >>>> > >>>> Actually there is. Today it works exactly as you described. > >>> > >>> Are you sure there's such hardware-imposed requirement? > >>> > >> > >> Yes. > >> > >>> Why would the hardware require VAs to match the ones in the > >>> userspace address-space, if it doesn't use the CPU MMU at all? > >>> > >> > >> It works like that: > >> > >> 1. We register a buffer from the process address space > >> giving its base address and length. > >> This call goes to kernel which in turn pins the phys pages > >> and registers them with the device *together* with the base > >> address (virtual address!) > >> 2. The device builds its own page tables to be able to translate > >> the virtual addresses to actual phys pages. > > > > How would the device be able to do that? It would require the > > device to look at the process page tables, wouldn't it? Isn't > > the HW IOVA->PA translation table built by the OS? > > > > As stated above, these are tables private for the device. > (They even have a hw vendor specific layout I think, > since the device holds some cache) > > The device looks at its own private page tables, and not > to the OS ones.
I'm still confused by your statement that the device builds its own [IOVA->PA] page table. How would the device do that if it doesn't have access to the CPU MMU state? Isn't the IOVA->PA translation table built by the OS? > > > > >> 3. The process executes post-send requests directly to hw by-passing > >> the kernel giving process virtual addresses in work requests. > >> 4. The device uses its own page tables to translate the virtual > >> addresses to phys pages and sending them. > >> > >> Theoretically is possible to send any contiguous IOVA instead of > >> process's one but is not how is working today. > >> > >> Makes sense? > > > -- Eduardo