On Mon, Jan 8, 2024 at 5:13 PM Gregory Price <gregory.pr...@memverge.com> wrote: > > On Mon, Jan 08, 2024 at 05:05:38PM -0800, Hao Xiang wrote: > > On Mon, Jan 8, 2024 at 2:47 PM Hao Xiang <hao.xi...@bytedance.com> wrote: > > > > > > On Mon, Jan 8, 2024 at 9:15 AM Gregory Price <gregory.pr...@memverge.com> > > > wrote: > > > > > > > > On Fri, Jan 05, 2024 at 09:59:19PM -0800, Hao Xiang wrote: > > > > > On Wed, Jan 3, 2024 at 1:56 PM Gregory Price > > > > > <gregory.pr...@memverge.com> wrote: > > > > > > > > > > > > For a variety of performance reasons, this will not work the way you > > > > > > want it to. You are essentially telling QEMU to map the vmem0 into > > > > > > a > > > > > > virtual cxl device, and now any memory accesses to that memory > > > > > > region > > > > > > will end up going through the cxl-type3 device logic - which is an > > > > > > IO > > > > > > path from the perspective of QEMU. > > > > > > > > > > I didn't understand exactly how the virtual cxl-type3 device works. I > > > > > thought it would go with the same "guest virtual address -> guest > > > > > physical address -> host physical address" translation totally done by > > > > > CPU. But if it is going through an emulation path handled by virtual > > > > > cxl-type3, I agree the performance would be bad. Do you know why > > > > > accessing memory on a virtual cxl-type3 device can't go with the > > > > > nested page table translation? > > > > > > > > > > > > > Because a byte-access on CXL memory can have checks on it that must be > > > > emulated by the virtual device, and because there are caching > > > > implications that have to be emulated as well. > > > > > > Interesting. Now that I see the cxl_type3_read/cxl_type3_write. If the > > > CXL memory data path goes through them, the performance would be > > > pretty problematic. We have actually run Intel's Memory Latency > > > Checker benchmark from inside a guest VM with both system-DRAM and > > > virtual CXL-type3 configured. The idle latency on the virtual CXL > > > memory is 2X of system DRAM, which is on-par with the benchmark > > > running from a physical host. I need to debug this more to understand > > > why the latency is actually much better than I would expect now. > > > > So we double checked on benchmark testing. What we see is that running > > Intel Memory Latency Checker from a guest VM with virtual CXL memory > > VS from a physical host with CXL1.1 memory expander has the same > > latency. > > > > From guest VM: local socket system-DRAM latency is 117.0ns, local > > socket CXL-DRAM latency is 269.4ns > > From physical host: local socket system-DRAM latency is 113.6ns , > > local socket CXL-DRAM latency is 267.5ns > > > > I also set debugger breakpoints on cxl_type3_read/cxl_type3_write > > while running the benchmark testing but those two functions are not > > ever hit. We used the virtual CXL configuration while launching QEMU > > but the CXL memory is present as a separate NUMA node and we are not > > creating devdax devices. Does that make any difference? > > > > Could you possibly share your full QEMU configuration and what OS/kernel > you are running inside the guest?
Sounds like the technical details are explained on the other thread. >From what I understand now, if we don't go through a complex CXL setup, it wouldn't go through the emulation path. Here is our exact setup. Guest runs Linux kernel 6.6rc2 taskset --cpu-list 0-47,96-143 \ numactl -N 0 -m 0 ${QEMU} \ -M q35,cxl=on,hmat=on \ -m 64G \ -smp 8,sockets=1,cores=8,threads=1 \ -object memory-backend-ram,id=ram0,size=45G \ -numa node,memdev=ram0,cpus=0-7,nodeid=0 \ -msg timestamp=on -L /usr/share/seabios \ -enable-kvm \ -object memory-backend-ram,id=vmem0,size=19G,host-nodes=${HOST_CXL_NODE},policy=bind \ -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \ -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \ -device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \ -numa node,memdev=vmem0,nodeid=1 \ -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=19G,cxl-fmw.0.interleave-granularity=8k \ -numa dist,src=0,dst=0,val=10 \ -numa dist,src=0,dst=1,val=14 \ -numa dist,src=1,dst=0,val=14 \ -numa dist,src=1,dst=1,val=10 \ -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=read-latency,latency=91 \ -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=read-latency,latency=100 \ -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=write-latency,latency=91 \ -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=write-latency,latency=100 \ -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=read-bandwidth,bandwidth=262100M \ -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=read-bandwidth,bandwidth=30000M \ -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=write-bandwidth,bandwidth=176100M \ -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=write-bandwidth,bandwidth=30000M \ -drive file="${DISK_IMG}",format=qcow2 \ -device pci-bridge,chassis_nr=3,id=pci.3,bus=pcie.0,addr=0xd \ -netdev tap,id=vm-sk-tap22,ifname=tap22,script=/usr/local/etc/qemu-ifup,downscript=no \ -device virtio-net-pci,netdev=vm-sk-tap22,id=net0,mac=02:11:17:01:7e:33,bus=pci.3,addr=0x1,bootindex=3 \ -serial mon:stdio > > The only thing I'm surprised by is that the numa node appears without > requiring the driver to generate the NUMA node. It's possible I missed > a QEMU update that allows this. > > ~Gregory