On Tue, Jan 2, 2024 at 5:04 AM David Hildenbrand <da...@redhat.com> wrote: > > On 01.01.24 08:53, Ho-Ren (Jack) Chuang wrote: > > Introduce a new configuration option 'host-mem-type=' in the > > '-object memory-backend-ram', allowing users to specify > > from which type of memory to allocate. > > > > Users can specify 'cxlram' as an argument, and QEMU will then > > automatically locate CXL RAM NUMA nodes and use them as the backend memory. > > For example: > > -object memory-backend-ram,id=vmem0,size=19G,host-mem-type=cxlram \ > > -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \ > > -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \ > > -device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \ > > -M > > cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=19G,cxl-fmw.0.interleave-granularity=8k > > \ > > > > You can achieve the exact same thing already simply by using memory > policies and detecting the node(s) before calling QEMU, no?
Yes, I agree this can be done with memory policy and bind to the CXL memory numa nodes on host. > > There has to be a good reason to add such a shortcut into QEMU, and it > should be spelled out here. So our end goal here is to enable CXL memory in the guest VM and have the guest kernel to recognize the CXL memory to the correct memory tier (slow tier) in the Linux kernel tiered memory system. Here is what we observed: * The kernel tiered memory system relies on calculating the memory attributes (read/write latency, bandwidth from ACPI) for fast vs slow tier. * The kernel tiered memory system has two path to recognize a memory tier 1) in mm subsystem init, memory_tier_init 2) in kmem driver device probe dev_dax_kmem_probe. Since the ACPI subsystem is initialized after mm, reading the memory attributes from ACPI can only be done in 2). CXL memory has to be presented as a devdax device, which can then be probed by the kmem driver in the guest and recognized as the slow tier. We do see that QEMU has this option "-numa hmat-lb" to set the memory attributes per VM's numa node. The problem is that setting the memory attributes per numa node means that the numa node is created during guest kernel initialization. A CXL devdax device can only be created post kernel initialization and new numa nodes are created for the CXL devdax devices. The guest kernel is not reading the memory attributes from "-numa hmat-lb" for the CXL devdax devices. So we thought if we create an explicit CXL memory backend, and associate that with the virtual CXL type-3 frontend, we can pass the CXL memory attributes from the host into the guest VM and avoid using memory policy and "-numa hmat-lb", thus simplifying the configuration. We are still figuring out exactly how to pass the memory attributes from the CXL backend into the VM. There is probably a solution to "-numa hmat-lb" for the CXL devdax devices as well and we are also looking into it. > > -- > Cheers, > > David / dhildenb >