On 15.10.21 11:10, david.dai wrote: > On Wed, Oct 13, 2021 at 10:33:39AM +0200, David Hildenbrand > (da...@redhat.com) wrote: >> >> CAUTION: This email originated from outside of the organization. Do not >> click links or open attachments unless you recognize the sender and know the >> content is safe. >> >> >> On 13.10.21 10:13, david.dai wrote: >>> On Mon, Oct 11, 2021 at 09:43:53AM +0200, David Hildenbrand >>> (da...@redhat.com) wrote: >>>> >>>> >>>> >>>>>> virito-mem currently relies on having a single sparse memory region (anon >>>>>> mmap, mmaped file, mmaped huge pages, mmap shmem) per VM. Although we can >>>>>> share memory with other processes, sharing with other VMs is not >>>>>> intended. >>>>>> Instead of actually mmaping parts dynamically (which can be quite >>>>>> expensive), virtio-mem relies on punching holes into the backend and >>>>>> dynamically allocating memory/file blocks/... on access. >>>>>> >>>>>> So the easy way to make it work is: >>>>>> >>>>>> a) Exposing the CXL memory to the buddy via dax/kmem, esulting in device >>>>>> memory getting managed by the buddy on a separate NUMA node. >>>>>> >>>>> >>>>> Linux kernel buddy system? how to guarantee other applications don't >>>>> apply memory >>>>> from it >>>> >>>> Excellent question. Usually, you would online the memory to ZONE_MOVABLE, >>>> such that even if some other allocation ended up there, that it could >>>> get migrated somewhere else. >>>> >>>> For example, "daxctl reconfigure-device" tries doing that as default: >>>> >>>> https://pmem.io/ndctl/daxctl-reconfigure-device.html >>>> >>>> However, I agree that we might actually want to tell the system to not >>>> use this CPU-less node as fallback for other allocations, and that we >>>> might not want to swap out such memory etc. >>>> >>>> >>>> But, in the end all that virtio-mem needs to work in the hypervisor is >>>> >>>> a) A sparse memmap (anonymous RAM, memfd, file) >>>> b) A way to populate memory within that sparse memmap (e.g., on fault, >>>> using madvise(MADV_POPULATE_WRITE), fallocate()) >>>> c) A way to discard memory (madvise(MADV_DONTNEED), >>>> fallocate(FALLOC_FL_PUNCH_HOLE)) >>>> >>>> So instead of using anonymous memory+mbind, you can also mmap a sparse file >>>> and rely on populate-on-demand. One alternative for your use case would be >>>> to create a DAX filesystem on that CXL memory (IIRC that should work) and >>>> simply providing virtio-mem with a sparse file located on that filesystem. >>>> >>>> Of course, you can also use some other mechanism as you might have in >>>> your approach, as long as it supports a,b,c. >>>> >>>>> >>>>>> >>>>>> b) (optional) allocate huge pages on that separate NUMA node. >>>>>> c) Use ordinary memory-device-ram or memory-device-memfd (for huge >>>>>> pages), >>>>>> *bidning* the memory backend to that special NUMA node. >>>>>> >>>>> "-object memory-backend/device-ram or memory-device-memfd, id=mem0, >>>>> size=768G" >>>>> How to bind backend memory to NUMA node >>>>> >>>> >>>> I think the syntax is "policy=bind,host-nodes=X" >>>> >>>> whereby X is a node mask. So for node "0" you'd use "host-nodes=0x1" for >>>> "5" >>>> "host-nodes=0x20" etc. >>>> >>>>>> >>>>>> This will dynamically allocate memory from that special NUMA node, >>>>>> resulting >>>>>> in the virtio-mem device completely being backed by that device memory, >>>>>> being able to dynamically resize the memory allocation. >>>>>> >>>>>> >>>>>> Exposing an actual devdax to the virtio-mem device, shared by multiple >>>>>> VMs >>>>>> isn't really what we want and won't work without major design changes. >>>>>> Also, >>>>>> I'm not so sure it's a very clean design: exposing memory belonging to >>>>>> other >>>>>> VMs to unrelated QEMU processes. This sounds like a serious security >>>>>> hole: >>>>>> if you managed to escalate to the QEMU process from inside the VM, you >>>>>> can >>>>>> access unrelated VM memory quite happily. You want an abstraction >>>>>> in-between, that makes sure each VM/QEMU process only sees private >>>>>> memory: >>>>>> for example, the buddy via dax/kmem. >>>>>> >>>>> Hi David >>>>> Thanks for your suggestion, also sorry for my delayed reply due to my >>>>> long vacation. >>>>> How does current virtio-mem dynamically attach memory to guest, via page >>>>> fault? >>>> >>>> Essentially you have a large sparse mmap. Withing that mmap, memory is >>>> populated on demand. Instead if mmap/munmap you perform a single large >>>> mmap and then dynamically populate memory/discard memory. >>>> >>>> Right now, memory is populated via page faults on access. This is >>>> sub-optimal when dealing with limited resources (i.e., hugetlbfs, >>>> file blocks) and you might run out of backend memory. >>>> >>>> I'm working on a "prealloc" mode, which will preallocate/populate memory >>>> necessary for exposing the next block of memory to the VM, and which >>>> fails gracefully if preallocation/population fails in the case of such >>>> limited resources. >>>> >>>> The patch resides on: >>>> https://github.com/davidhildenbrand/qemu/tree/virtio-mem-next >>>> >>>> commit ded0e302c14ae1b68bdce9059dcca344e0a5f5f0 >>>> Author: David Hildenbrand <da...@redhat.com> >>>> Date: Mon Aug 2 19:51:36 2021 +0200 >>>> >>>> virtio-mem: support "prealloc=on" option >>>> Especially for hugetlb, but also for file-based memory backends, we'd >>>> like to be able to prealloc memory, especially to make user errors >>>> less >>>> severe: crashing the VM when there are not sufficient huge pages >>>> around. >>>> A common option for hugetlb will be using "reserve=off,prealloc=off" >>>> for >>>> the memory backend and "prealloc=on" for the virtio-mem device. This >>>> way, no huge pages will be reserved for the process, but we can >>>> recover >>>> if there are no actual huge pages when plugging memory. >>>> Signed-off-by: David Hildenbrand <da...@redhat.com> >>>> >>>> >>>> -- >>>> Thanks, >>>> >>>> David / dhildenb >>>> >>> >>> Hi David, >>> >>> After read virtio-mem code, I understand what you have expressed, please >>> allow me to describe >>> my understanding to virtio-mem, so that we have a aligned view. >>> >>> Virtio-mem: >>> Virtio-mem device initializes and reserved a memory area(GPA), later >>> memory dynamically >>> growing/shrinking will not exceed this scope, memory-backend-ram has >>> mapped anon. memory >>> to the whole area, but no ram is attached because Linux have a policy to >>> delay allocation. >> >> Right, but it can also be any sparse file (memory-backend-memfd, >> memory-backend-file). >> >>> When virtio-mem driver apply to dynamically add memory to guest, it first >>> request a region >>> from the reserved memory area, then notify virtio-mem device to record >>> the information >>> (virtio-mem device doesn't make real memory allocation). After received >>> response from >> >> In the upcoming prealloc=on mode I referenced, the allocation will happen >> before the guest is notified about success and starts using the memory. >> >> With vfio/mdev support, the allocation will happen nowadays already, when >> vfio/mdev is notified about the populated memory ranges (see >> RamDiscardManager). That's essentially what makes virtio-mem device >> passthrough work. >> >>> virtio-mem deivce, virtio-mem driver will online the requested region and >>> add it to Linux >>> page allocator. Real ram allocation will happen via page fault when guest >>> cpu access it. >>> Memory shrink will be achieved by madvise() >> >> Right, but you could write a custom virtio-mem driver that pools this memory >> differently. >> >> Memory shrinking in the hypervisor is either done using madvise(DONMTNEED) >> or fallocate(FALLOC_FL_PUNCH_HOLE) >> >>> >>> Questions: >>> 1. heterogeneous computing, memory may be accessed by CPUs on host side and >>> device side. >>> Memory delayed allocation is not suitable. Host software(for instance, >>> OpenCL) may >>> allocate a buffer to computing device to place the computing result in. >> >> That works already with virtio-mem with vfio/mdev via the RamDiscardManager >> infrastructure introduced recently. With "prealloc=on", the delayed memory >> allocation can also be avoided without vfio/mdev. >> >>> 2. we hope build ourselves page allocator in host kernel, so it can offer >>> customized mmap() >>> method to build va->pa mapping in MMU and IOMMU. >> >> Theoretically, you can wire up pretty much any driver in QEMU like vfio/mdev >> via the RamDiscardManager. From there, you can issue whatever syscall you >> need to popualte memory when plugging new memory blocks. All you need to >> support is a sparse mmap and a way to populate/discard memory. >> Populate/discard could be wired up in QEMU virtio-mem code as you need it. >> >>> 3. some potential requirements also require our driver to manage memory, so >>> that page size >>> granularity can be controlled to fit small device iotlb cache. >>> CXL has bias mode for HDM(host managed device memory), it needs >>> physical address to make >>> bias mode switch between host access and device access. These tell us >>> driver manage memory >>> is mandatory. >> >> I think if you write your driver in a certain way and wire it up in QEMU >> virtio-mem accordingly (e.g., using a new memory-backend-whatever), that >> shouldn't be an issue. >> > > Thanks a lot, so let me have a try.
Let me know if you need some help or run into issues! Further, if we need spec extensions to handle some additional requirements, that's also not really an issue. I certainly don't want you to use virtio-mem by any means. However "virtual pci device to dynamically attach memory to QEMU" is essentially what virtio-mem was does :) . As it's already compatible with vfio/mdev and soon has full support for dealing with limited resources (preallocation support, VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE), it feels like a good fit for your use case as well, although some details are left to be figured out. (also, virtio-mem solved a lot of the issues related to guest memory dumping, VM snapshotting/migration, and how to make it consumable by upper layers like libvirt -- so you would get that for almost free as well) -- Thanks, David / dhildenb