On Mon, Jan 8, 2024 at 2:47 PM Hao Xiang <hao.xi...@bytedance.com> wrote: > > On Mon, Jan 8, 2024 at 9:15 AM Gregory Price <gregory.pr...@memverge.com> > wrote: > > > > On Fri, Jan 05, 2024 at 09:59:19PM -0800, Hao Xiang wrote: > > > On Wed, Jan 3, 2024 at 1:56 PM Gregory Price <gregory.pr...@memverge.com> > > > wrote: > > > > > > > > For a variety of performance reasons, this will not work the way you > > > > want it to. You are essentially telling QEMU to map the vmem0 into a > > > > virtual cxl device, and now any memory accesses to that memory region > > > > will end up going through the cxl-type3 device logic - which is an IO > > > > path from the perspective of QEMU. > > > > > > I didn't understand exactly how the virtual cxl-type3 device works. I > > > thought it would go with the same "guest virtual address -> guest > > > physical address -> host physical address" translation totally done by > > > CPU. But if it is going through an emulation path handled by virtual > > > cxl-type3, I agree the performance would be bad. Do you know why > > > accessing memory on a virtual cxl-type3 device can't go with the > > > nested page table translation? > > > > > > > Because a byte-access on CXL memory can have checks on it that must be > > emulated by the virtual device, and because there are caching > > implications that have to be emulated as well. > > Interesting. Now that I see the cxl_type3_read/cxl_type3_write. If the > CXL memory data path goes through them, the performance would be > pretty problematic. We have actually run Intel's Memory Latency > Checker benchmark from inside a guest VM with both system-DRAM and > virtual CXL-type3 configured. The idle latency on the virtual CXL > memory is 2X of system DRAM, which is on-par with the benchmark > running from a physical host. I need to debug this more to understand > why the latency is actually much better than I would expect now.
So we double checked on benchmark testing. What we see is that running Intel Memory Latency Checker from a guest VM with virtual CXL memory VS from a physical host with CXL1.1 memory expander has the same latency. >From guest VM: local socket system-DRAM latency is 117.0ns, local socket CXL-DRAM latency is 269.4ns >From physical host: local socket system-DRAM latency is 113.6ns , local socket CXL-DRAM latency is 267.5ns I also set debugger breakpoints on cxl_type3_read/cxl_type3_write while running the benchmark testing but those two functions are not ever hit. We used the virtual CXL configuration while launching QEMU but the CXL memory is present as a separate NUMA node and we are not creating devdax devices. Does that make any difference? > > > > > The cxl device you are using is an emulated CXL device - not a > > virtualization interface. Nuanced difference: the emulated device has > > to emulate *everything* that CXL device does. > > > > What you want is passthrough / managed access to a real device - > > virtualization. This is not the way to accomplish that. A better way > > to accomplish that is to simply pass the memory through as a static numa > > node as I described. > > That would work, too. But I think a kernel change is required to > establish the correct memory tiering if we go this routine. > > > > > > > > > When we had a discussion with Intel, they told us to not use the KVM > > > option in QEMU while using virtual cxl type3 device. That's probably > > > related to the issue you described here? We enabled KVM though but > > > haven't seen the crash yet. > > > > > > > The crash really only happens, IIRC, if code ends up hosted in that > > memory. I forget the exact scenario, but the working theory is it has > > to do with the way instruction caches are managed with KVM and this > > device. > > > > > > > > > > You're better off just using the `host-nodes` field of host-memory > > > > and passing bandwidth/latency attributes though via `-numa hmat-lb` > > > > > > We tried this but it doesn't work from end to end right now. I > > > described the issue in another fork of this thread. > > > > > > > > > > > In that scenario, the guest software doesn't even need to know CXL > > > > exists at all, it can just read the attributes of the numa node > > > > that QEMU created for it. > > > > > > We thought about this before. But the current kernel implementation > > > requires a devdax device to be probed and recognized as a slow tier > > > (by reading the memory attributes). I don't think this can be done via > > > the path you described. Have you tried this before? > > > > > > > Right, because the memory tiering component lumps the nodes together. > > > > Better idea: Fix the memory tiering component > > > > I cc'd you on another patch line that is discussing something relevant > > to this. > > > > https://lore.kernel.org/linux-mm/87fs00njft....@yhuang6-desk2.ccr.corp.intel.com/T/#m32d58f8cc607aec942995994a41b17ff711519c8 > > > > The point is: There's no need for this to be a dax device at all, there > > is no need for the guest to even know what is providing the memory, or > > for the guest to have any management access to the memory. It just > > wants the memory and the ability to tier it. > > > > So we should fix the memory tiering component to work with this > > workflow. > > Agreed. We really don't need the devdax device at all. I thought that > choice was made due to the memory tiering concept being started with > pmem ... Let's continue this part of the discussion on the above > thread. > > > > > ~Gregory