On Tue, Jan 9, 2024 at 2:13 PM Gregory Price <gregory.pr...@memverge.com> wrote:
>
> On Tue, Jan 09, 2024 at 01:27:28PM -0800, Hao Xiang wrote:
> > On Tue, Jan 9, 2024 at 11:58 AM Gregory Price
> > <gregory.pr...@memverge.com> wrote:
> > >
> > > If you drop this line:
> > >
> > > -numa node,memdev=vmem0,nodeid=1
> >
> > We tried this as well and it works after going through the cxlcli
> > process and created the devdax device. The problem is that without the
> > "nodeid=1" configuration, we cannot connect with the explicit per numa
> > node latency/bandwidth configuration "-numa hmat-lb". I glanced at the
> > code in hw/numa.c, parse_numa_hmat_lb() looks like the one passing the
> > lb information to VM's hmat.
> >
>
> Yeah, this is what Jonathan was saying - right now there isn't a good
> way (in QEMU) to pass the hmat/cdat stuff down through the device.
> Needs to be plumbed out.
>
> In the meantime: You should just straight up drop the cxl device from
> your QEMU config.  It doesn't actually get you anything.
>
> > From what I understand so far, the guest kernel will dynamically
> > create a numa node after a cxl devdax device is created. That means we
> > don't know the numa node until after VM boot. 2. QEMU can only
> > statically parse the lb information to the VM at boot time. How do we
> > connect these two things?
>
> during boot, the kernel discovers all the memory regions exposed to
> bios. In this qemu configuration you have defined:
>
> region 0: CPU + DRAM node
> region 1: DRAM only node
> region 2: CXL Fixed Memory Window (the last line of the cxl stuff)
>
> The kernel reads this information on boot and reserves 1 numa node for
> each of these regions.
>
> The kernel then automatically brings up regions 0 and 1 in nodes 0 and 1
> respectively.
>
> Node2 sits dormant until you go through the cxl-cli startup sequence.
>
>
> What you're asking for is for the QEMU team to plumb hmat/cdat
> information down through the type3 device.  I *think* that is presently
> possible with a custom CDAT file - but Jonathan probably has more
> details on that.  You'll have to go digging for answers on that one.

I think this is exactly what I was looking for. When we started with
the idea of having an explicit CXL memory backend, we wanted to
1) Bind a virtual CXL device to an actual CXL memory node on host.
2) Pass the latency/bandwidth information from the CXL backend into
the virtual CXL device.
I didn't have a concrete idea of how to do 2)
With the discussion here, I learned that the information is passed
from CDAT. Just looked into the virtual CXL code and found that
ct3_build_cdat_entries_for_mr() is the function that builds this
information. But the latency and bandwidth there are currently
hard-coded. I think it makes sense to have an explicit CXL memory
backend where QEMU can query the CXL memory attributes from the host
and pass that information from the CXL backend into the virtual CXL
type-3 device.

>
>
> Now - even if you did that - the current state of the cxl-type3 device
> is *not what you want* because your memory accesses will be routed
> through the read/write functions in the emulated device.
>
> What Jonathan and I discussed on the other thread is how you might go
> about slimming this down to allow pass-through of the memory without the
> need for all the fluff.  This is a non-trivial refactor of the existing
> device, so i would not expect that any time soon.
>
> At the end of the day, quickest way to get-there-from-here is to just
> drop the cxl related lines from your QEMU config, and keep everything
> else.

Agreed. We need the kernel to be capable of reading the memory
attributes from HMAT and establish the correct memory tier for
system-DRAM (on a CPUless numa node). Currently system-DRAM is assumed
to always be fast tier.

>
> >
> > Assuming that the same issue applies to a physical server with CXL.
> > Were you able to see a host kernel getting the correct lb information
> > for a CXL devdax device?
> >
>
> Yes, if you bring up a CXL device via cxl-cli on real hardware, the
> subsequent numa node ends up in the "lower tier" of the memory-tiering
> infrastructure.
>
> ~Gregory

Reply via email to