Re: [PATCH] hw/misc: Add a virtual pci device to dynamically attach memory to QEMU

David Hildenbrand Fri, 15 Oct 2021 02:51:49 -0700

On 15.10.21 11:10, david.dai wrote:
> On Wed, Oct 13, 2021 at 10:33:39AM +0200, David Hildenbrand 
> (da...@redhat.com) wrote:
>>
>> CAUTION: This email originated from outside of the organization. Do not
>> click links or open attachments unless you recognize the sender and know the
>> content is safe.
>>
>>
>> On 13.10.21 10:13, david.dai wrote:
>>> On Mon, Oct 11, 2021 at 09:43:53AM +0200, David Hildenbrand 
>>> (da...@redhat.com) wrote:
>>>>
>>>>
>>>>
>>>>>> virito-mem currently relies on having a single sparse memory region (anon
>>>>>> mmap, mmaped file, mmaped huge pages, mmap shmem) per VM. Although we can
>>>>>> share memory with other processes, sharing with other VMs is not 
>>>>>> intended.
>>>>>> Instead of actually mmaping parts dynamically (which can be quite
>>>>>> expensive), virtio-mem relies on punching holes into the backend and
>>>>>> dynamically allocating memory/file blocks/... on access.
>>>>>>
>>>>>> So the easy way to make it work is:
>>>>>>
>>>>>> a) Exposing the CXL memory to the buddy via dax/kmem, esulting in device
>>>>>> memory getting managed by the buddy on a separate NUMA node.
>>>>>>
>>>>>
>>>>> Linux kernel buddy system? how to guarantee other applications don't 
>>>>> apply memory
>>>>> from it
>>>>
>>>> Excellent question. Usually, you would online the memory to ZONE_MOVABLE,
>>>> such that even if some other allocation ended up there, that it could
>>>> get migrated somewhere else.
>>>>
>>>> For example, "daxctl reconfigure-device" tries doing that as default:
>>>>
>>>> https://pmem.io/ndctl/daxctl-reconfigure-device.html
>>>>
>>>> However, I agree that we might actually want to tell the system to not
>>>> use this CPU-less node as fallback for other allocations, and that we
>>>> might not want to swap out such memory etc.
>>>>
>>>>
>>>> But, in the end all that virtio-mem needs to work in the hypervisor is
>>>>
>>>> a) A sparse memmap (anonymous RAM, memfd, file)
>>>> b) A way to populate memory within that sparse memmap (e.g., on fault,
>>>>     using madvise(MADV_POPULATE_WRITE), fallocate())
>>>> c) A way to discard memory (madvise(MADV_DONTNEED),
>>>>     fallocate(FALLOC_FL_PUNCH_HOLE))
>>>>
>>>> So instead of using anonymous memory+mbind, you can also mmap a sparse file
>>>> and rely on populate-on-demand. One alternative for your use case would be
>>>> to create a DAX  filesystem on that CXL memory (IIRC that should work) and
>>>> simply providing virtio-mem with a sparse file located on that filesystem.
>>>>
>>>> Of course, you can also use some other mechanism as you might have in
>>>> your approach, as long as it supports a,b,c.
>>>>
>>>>>
>>>>>>
>>>>>> b) (optional) allocate huge pages on that separate NUMA node.
>>>>>> c) Use ordinary memory-device-ram or memory-device-memfd (for huge 
>>>>>> pages),
>>>>>> *bidning* the memory backend to that special NUMA node.
>>>>>>
>>>>> "-object memory-backend/device-ram or memory-device-memfd, id=mem0, 
>>>>> size=768G"
>>>>> How to bind backend memory to NUMA node
>>>>>
>>>>
>>>> I think the syntax is "policy=bind,host-nodes=X"
>>>>
>>>> whereby X is a node mask. So for node "0" you'd use "host-nodes=0x1" for 
>>>> "5"
>>>> "host-nodes=0x20" etc.
>>>>
>>>>>>
>>>>>> This will dynamically allocate memory from that special NUMA node, 
>>>>>> resulting
>>>>>> in the virtio-mem device completely being backed by that device memory,
>>>>>> being able to dynamically resize the memory allocation.
>>>>>>
>>>>>>
>>>>>> Exposing an actual devdax to the virtio-mem device, shared by multiple 
>>>>>> VMs
>>>>>> isn't really what we want and won't work without major design changes. 
>>>>>> Also,
>>>>>> I'm not so sure it's a very clean design: exposing memory belonging to 
>>>>>> other
>>>>>> VMs to unrelated QEMU processes. This sounds like a serious security 
>>>>>> hole:
>>>>>> if you managed to escalate to the QEMU process from inside the VM, you 
>>>>>> can
>>>>>> access unrelated VM memory quite happily. You want an abstraction
>>>>>> in-between, that makes sure each VM/QEMU process only sees private 
>>>>>> memory:
>>>>>> for example, the buddy via dax/kmem.
>>>>>>
>>>>> Hi David
>>>>> Thanks for your suggestion, also sorry for my delayed reply due to my 
>>>>> long vacation.
>>>>> How does current virtio-mem dynamically attach memory to guest, via page 
>>>>> fault?
>>>>
>>>> Essentially you have a large sparse mmap. Withing that mmap, memory is
>>>> populated on demand. Instead if mmap/munmap you perform a single large
>>>> mmap and then dynamically populate memory/discard memory.
>>>>
>>>> Right now, memory is populated via page faults on access. This is
>>>> sub-optimal when dealing with limited resources (i.e., hugetlbfs,
>>>> file blocks) and you might run out of backend memory.
>>>>
>>>> I'm working on a "prealloc" mode, which will preallocate/populate memory
>>>> necessary for exposing the next block of memory to the VM, and which
>>>> fails gracefully if preallocation/population fails in the case of such
>>>> limited resources.
>>>>
>>>> The patch resides on:
>>>>    https://github.com/davidhildenbrand/qemu/tree/virtio-mem-next
>>>>
>>>> commit ded0e302c14ae1b68bdce9059dcca344e0a5f5f0
>>>> Author: David Hildenbrand <da...@redhat.com>
>>>> Date:   Mon Aug 2 19:51:36 2021 +0200
>>>>
>>>>      virtio-mem: support "prealloc=on" option
>>>>      Especially for hugetlb, but also for file-based memory backends, we'd
>>>>      like to be able to prealloc memory, especially to make user errors 
>>>> less
>>>>      severe: crashing the VM when there are not sufficient huge pages 
>>>> around.
>>>>      A common option for hugetlb will be using "reserve=off,prealloc=off" 
>>>> for
>>>>      the memory backend and "prealloc=on" for the virtio-mem device. This
>>>>      way, no huge pages will be reserved for the process, but we can 
>>>> recover
>>>>      if there are no actual huge pages when plugging memory.
>>>>      Signed-off-by: David Hildenbrand <da...@redhat.com>
>>>>
>>>>
>>>> -- 
>>>> Thanks,
>>>>
>>>> David / dhildenb
>>>>
>>>
>>> Hi David,
>>>
>>> After read virtio-mem code, I understand what you have expressed, please 
>>> allow me to describe
>>> my understanding to virtio-mem, so that we have a aligned view.
>>>
>>> Virtio-mem:
>>>   Virtio-mem device initializes and reserved a memory area(GPA), later 
>>> memory dynamically
>>>   growing/shrinking will not exceed this scope, memory-backend-ram has 
>>> mapped anon. memory
>>>   to the whole area, but no ram is attached because Linux have a policy to 
>>> delay allocation.
>>
>> Right, but it can also be any sparse file (memory-backend-memfd,
>> memory-backend-file).
>>
>>>   When virtio-mem driver apply to dynamically add memory to guest, it first 
>>> request a region
>>>   from the reserved memory area, then notify virtio-mem device to record 
>>> the information
>>>   (virtio-mem device doesn't make real memory allocation). After received 
>>> response from
>>
>> In the upcoming prealloc=on mode I referenced, the allocation will happen
>> before the guest is notified about success and starts using the memory.
>>
>> With vfio/mdev support, the allocation will happen nowadays already, when
>> vfio/mdev is notified about the populated memory ranges (see
>> RamDiscardManager). That's essentially what makes virtio-mem device
>> passthrough work.
>>
>>>   virtio-mem deivce, virtio-mem driver will online the requested region and 
>>> add it to Linux
>>>   page allocator. Real ram allocation will happen via page fault when guest 
>>> cpu access it.
>>>   Memory shrink will be achieved by madvise()
>>
>> Right, but you could write a custom virtio-mem driver that pools this memory
>> differently.
>>
>> Memory shrinking in the hypervisor is either done using madvise(DONMTNEED)
>> or fallocate(FALLOC_FL_PUNCH_HOLE)
>>
>>>
>>> Questions:
>>> 1. heterogeneous computing, memory may be accessed by CPUs on host side and 
>>> device side.
>>>     Memory delayed allocation is not suitable. Host software(for instance, 
>>> OpenCL) may
>>>     allocate a buffer to computing device to place the computing result in.
>>
>> That works already with virtio-mem with vfio/mdev via the RamDiscardManager
>> infrastructure introduced recently. With "prealloc=on", the delayed memory
>> allocation can also be avoided without vfio/mdev.
>>
>>> 2. we hope build ourselves page allocator in host kernel, so it can offer 
>>> customized mmap()
>>>     method to build va->pa mapping in MMU and IOMMU.
>>
>> Theoretically, you can wire up pretty much any driver in QEMU like vfio/mdev
>> via the RamDiscardManager. From there, you can issue whatever syscall you
>> need to popualte memory when plugging new memory blocks. All you need to
>> support is a sparse mmap and a way to populate/discard memory.
>> Populate/discard could be wired up in QEMU virtio-mem code as you need it.
>>
>>> 3. some potential requirements also require our driver to manage memory, so 
>>> that page size
>>>     granularity can be controlled to fit small device iotlb cache.
>>>     CXL has bias mode for HDM(host managed device memory), it needs 
>>> physical address to make
>>>     bias mode switch between host access and device access. These tell us 
>>> driver manage memory
>>>     is mandatory.
>>
>> I think if you write your driver in a certain way and wire it up in QEMU
>> virtio-mem accordingly (e.g., using a new memory-backend-whatever), that
>> shouldn't be an issue.
>>
> 
> Thanks a lot, so let me have a try.


Let me know if you need some help or run into issues! Further, if we
need spec extensions to handle some additional requirements, that's also
not really an issue.

I certainly don't want you to use virtio-mem by any means. However
"virtual pci device to dynamically attach memory to QEMU" is essentially
what virtio-mem was does :) .  As it's already compatible with vfio/mdev
and soon has full support for dealing with limited resources
(preallocation support, VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE), it feels
like a good fit for your use case as well, although some details are
left to be figured out.

(also, virtio-mem solved a lot of the issues related to guest memory
dumping, VM snapshotting/migration, and how to make it consumable by
upper layers like libvirt -- so you would get that for almost free as well)

-- 
Thanks,

David / dhildenb

Re: [PATCH] hw/misc: Add a virtual pci device to dynamically attach memory to QEMU

Reply via email to