On Tue, 9 Jan 2024 15:55:46 -0800
Hao Xiang <hao.xi...@bytedance.com> wrote:

> On Tue, Jan 9, 2024 at 2:13 PM Gregory Price <gregory.pr...@memverge.com> 
> wrote:
> >
> > On Tue, Jan 09, 2024 at 01:27:28PM -0800, Hao Xiang wrote:  
> > > On Tue, Jan 9, 2024 at 11:58 AM Gregory Price
> > > <gregory.pr...@memverge.com> wrote:  
> > > >
> > > > If you drop this line:
> > > >
> > > > -numa node,memdev=vmem0,nodeid=1  
> > >
> > > We tried this as well and it works after going through the cxlcli
> > > process and created the devdax device. The problem is that without the
> > > "nodeid=1" configuration, we cannot connect with the explicit per numa
> > > node latency/bandwidth configuration "-numa hmat-lb". I glanced at the
> > > code in hw/numa.c, parse_numa_hmat_lb() looks like the one passing the
> > > lb information to VM's hmat.
> > >  
> >
> > Yeah, this is what Jonathan was saying - right now there isn't a good
> > way (in QEMU) to pass the hmat/cdat stuff down through the device.
> > Needs to be plumbed out.
> >
> > In the meantime: You should just straight up drop the cxl device from
> > your QEMU config.  It doesn't actually get you anything.
> >  
> > > From what I understand so far, the guest kernel will dynamically
> > > create a numa node after a cxl devdax device is created. That means we
> > > don't know the numa node until after VM boot. 2. QEMU can only
> > > statically parse the lb information to the VM at boot time. How do we
> > > connect these two things?  
> >
> > during boot, the kernel discovers all the memory regions exposed to
> > bios. In this qemu configuration you have defined:
> >
> > region 0: CPU + DRAM node
> > region 1: DRAM only node
> > region 2: CXL Fixed Memory Window (the last line of the cxl stuff)
> >
> > The kernel reads this information on boot and reserves 1 numa node for
> > each of these regions.
> >
> > The kernel then automatically brings up regions 0 and 1 in nodes 0 and 1
> > respectively.
> >
> > Node2 sits dormant until you go through the cxl-cli startup sequence.
> >
> >
> > What you're asking for is for the QEMU team to plumb hmat/cdat
> > information down through the type3 device.  I *think* that is presently
> > possible with a custom CDAT file - but Jonathan probably has more
> > details on that.  You'll have to go digging for answers on that one.  
> 
> I think this is exactly what I was looking for. When we started with
> the idea of having an explicit CXL memory backend, we wanted to
> 1) Bind a virtual CXL device to an actual CXL memory node on host.
> 2) Pass the latency/bandwidth information from the CXL backend into
> the virtual CXL device.
> I didn't have a concrete idea of how to do 2)
> With the discussion here, I learned that the information is passed
> from CDAT. Just looked into the virtual CXL code and found that
> ct3_build_cdat_entries_for_mr() is the function that builds this
> information. But the latency and bandwidth there are currently
> hard-coded. I think it makes sense to have an explicit CXL memory
> backend where QEMU can query the CXL memory attributes from the host
> and pass that information from the CXL backend into the virtual CXL
> type-3 device.

There is probably an argument for a memory backend to be able to take
perf numbers in general (I don't see this as being CXL specific) or for
us adding more parameters to the cxl device entry, but for now you can
inject a cdat file that presents whatever you like.  

What we are missing though is generic port creation, so even with
everything else in place it won't quite work yet. There was a hacky
patch for generic ports, but it's not upstream yet (or in my tree).

Usefully there is work under review for adding generic initiators to
qemu that we can repurpose most of for GPs.

Jonathan

Reply via email to