Re: [RFC] cxl: Multi-headed device design

Gregory Price Tue, 16 May 2023 08:37:35 -0700

On Mon, May 15, 2023 at 05:18:07PM +0100, Jonathan Cameron wrote:
> On Tue, 21 Mar 2023 21:50:33 -0400
> Gregory Price <gregory.pr...@memverge.com> wrote:
> 
> > 
> > Ambiguity #1:
> > 
> > * An SLD contains 1 Logical Device.
> > * An MH-SLD presents multiple SLDs, one per head.
> > 
> > Ergo an MH-SLD contains multiple LDs which makes it an MLD according to the
> > definition of LD, but not according to the definition of MLD, or MH-MLD.
> 
> I'd go with 'sort of'.  SLD is a presentation of a device to a host.
> It can be a normal single headed MLD that has been plugged directly into a 
> host.
> 
> So for extra fun points you can have one MH-MLD that has some ports connected
> to switches and other directly to hosts. Thus it can present as SLD on some
> upstream ports and as MLD on others.
>


I suppose this section of the email was really to just point out that
what constitutions a "multi-headed", "logical", and "multi-logical"
device is rather confusing from just reading the spec.  Since writing
this, i've kind of settled on:

MH-* - anything with multiple heads, regardless of how it works
SLD - one LD per head, but LD does not imply any particular command set
MLD - multiple LD's per head, but those LD's may only attach to one head
DCD - anything can technically be a DCD if it implements the commands

Trying to figure out, from the spec, "what commands an MH-SLD" should
implement to be "Spec Compliance" was my frustration.  It's somewhat
clear now that the answer is "Technically nothing... unless its an MLD".

> > I want to make very close note of this.  SLD's are managed like SLDs
> > SLDs, MLDs are managed like MLDs.  MH-SLDs, according to this, should be
> > managed like SLDs from the perspective of each host.
> 
> True, but an MH-MLD device connected directly to a host will also 
> be managed (at some level anyway) as an SLD on that particular port.
>

The ambiguous part is... what commands relate specifically to an SLD?
The spec isn't really written that way, and the answer is that an SLD is
more of a lack of other functionality (specifically MLD functionality),
rather than its own set of functionality.

i.e. an SLD does not require an FM-Owned LD for management, but an MHD,
MLD, and DCD all do (at least in theory).

> > 
> > 2.5.1 continues on to describe "LD Management in MH-MLDs" but just ignores
> > that MH-SLDs (may) exist.  That's frustrating to say the least, but I
> > suppose we can gather from context that MH-SLD's *MAY NOT* have LD
> > management controls.
> 
> Hmm. In theory you could have an MH-SLD that used a config from flash or 
> similar
> but that would be odd.  We need some level of dynamic control to make these
> devices useful.  Doesn't mean the spec should exclude dumb devices, but
> we shouldn't concentrate on them for emulation.
> 
> One possible usecase would be a device that always shares all it's memory on
> all ports. Yuk.
> 

I can say that the earliest forms of MH-SLD, and certainly pre-DCD, is
likely to present all memory on all ports, and potentially provide some
custom commands to help hosts enforce exclusivity.

It's beyond the spec, but this can actually be emulated today with the
MH-SLD setup I describe below.  Certainly I expected a yuk factor to
proposing it, but I think the reality is on the path to 3.0 and DCD
devices we should at least entertain that someone will probably do this
with real hardware.

> > For the simplest MH-SLD device, these fields would be immutable, and
> > there would be a single LD for each head, where head_id == ld_id.
> 
> Agreed.
> 
> > 
> > So summarizing, what I took away from this was the following:
> > 
> > In the simplest form of MH-SLD, there's is neither a switch, nor is
> > there LD management.  So, presumably, we don't HAVE to implement the
> > MHD commands to say we "have MH-SLD support".
> 
> Whilst theoretically possible - I don' think such a device is interesting.
> Minimum I'd want to see is something with multiple upstream SLD ports
> and a management LD with appropriate interface to poke it.
> 
>
> The MLD side of things is interesting only once we support MLDs in general
> in QEMU CXL emulation and even then they are near invisible to a host
> and are more interesting for emulating fabric management.
> 
> What you may want to do is take Fan's work on DCD and look at doing
> a simple MH-SLD device that uses same cheat of just using QMP commands
> to do the configuration.  That's an intermediate step to us getting
> the FM-API and similar commands implemented.
> 

I actually think it's a good step to go from MH-SLD to MH-SLD+DCD while
not having to worry about the complexity of MLD and switches.

(I have not gotten the chance to review the DCD patch set yet, it's on
my list for after ISC'23, I presume this is what has been done).

My thoughts would be that you would have something like the following:

-device ct3d,... etc etc
-device cxl-dcd,type3-backend=mem0,manager=true

the manager would be the owner of the FM-Owned LD, and therefore the
system responsible for managing requests for memory.

How we pass those messages between instances is then an exercise for the
reader.


What I have been doing is just creating a shared memory region with
mkipc and using a separate program to initiate that shared state before
launching the guests.  I'll talk about this a little further down.


> > 
> > ... snip ...
> > 
> > 3. MH-SLD w/ Pool CCI  (Implementing only Get Multi-Headed Info)
> 
> I'd do this + DCD.
> 

I concur, and it's what i was looking into next.

I think your other notes on MH-* with switches is really where I was
left scratching my head.

When I look at Switch/MLD functionality vs DCD, I have a gut feeling the
vast majority of early device vendors are going to skip right over
switches and MLD setups and go directly to MH-SLD+DCD.

> > =================================
> > 2. MH-SLD No Switch, No Pool CCI.
> > =================================
> > 
> > But it's also not very useful.  You can only use the memory in devdax
> > mode, since it's a shared memory region. You could already do this via
> > the /dev/shm interface, so it's not even new functionality.
> > 
> > In theory you could build a pooling service in software-only on top of
> > memory blocks. That's an exercise left to the reader.
> 
> Yeah. Let's not do this step.
> 

To late :].  It was useful as a learning exercise, but it's definitely
not upstream quality.  I may post it for the sake of the playground, but
I too would recommend against this method of pooling in the long term.

I made a proto-DCD command set that was reachable from each memdev
character device, and exposed it to every qemu instance as part of ct3d
(I'm still learning the QEMU ecosystem, so was easier to bodge it in
than make a new device and link it up).

Then I created a shared memory region with mkipc, and implemented a
simple mutex in the space, as well as all the record keeping needed to
manage sections/extents.

> > shmid1=`ipcmk -M 4096 | grep -o -E '[0-9]+' | head -1`
> > ./cxl_mhd_init 4 $shmid1
> > -device 
> > cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0,mhd=true,mhd_head=0,mhd_shmid=$1
> > 
> > ./cxl_mhd_init would simply setup the nr_heads/lds field and such
> > and set ldmap[0-3] to the values [0-3].  i.e. the head-to-ld mappings
> > are static (head_id==ld_id).
> > ... snip ...
> >
> > shmid1=`ipcmk -M 4096 | grep -o -E '[0-9]+' | head -1`
> > ./cxl_mhd_init 4 $shmid1
> > -device 
> > cxl-mhd-sld,bus=rp0,volatile-memdev=mem0,id=cxl-mem0,mhd_head=0,mhd_shmid=shmid

The last step was a few extra lines in the read/write functions to
ensure accesses to "Valid addresses" that "Aren't allocated" produce
errors.

At this point, each guest is capable basically using the device to do
the coordination for you by simply calling the allocate/deallocate
functions.

And that's it, you've got pooling.  Each guest sees the full extent of
the entire device, but must ask the device for access to a given
section, and the section can be translated into a memory block number
under the given numa node.


Ok, now lets talk about why this is a bad and why you shouldn't do it
this way:

* Technically a number of bios/hardware interleave functionality can
  bite you pretty hard when making the assumption that memory blocks are
  physically contiguous hardware addresses. However, that assumption
  holds if you simply don't turn those options on, so it might be useful
  as an early-adopter platform.


* The security posutre of a device like this is bad.  It requires each
  attached host to clear the memory before releasing it.  There isn't
  really a good way to do this in numa-mode, so you would have to
  implement custom firmware commands to ensure it happens, and that
  means custom drivers blah blah blah - not great.

  Basically you're trusting each host to play nice.  Not great.
  But potentially useful for early adopters regardless.


* General compaitibility and being in-spec - this design requires a
  number of non-spec extensions, so just generally not recommended,
  certainly not here in QEMU.

> 
> A few different moving parts are needed and I think we'd end up with 
> something that
> looks like
> 
> -device cxl-mhd,volatile-memdev=mem0,id=backend
> -device cxl-mhd-sld,mhd=backend,bus=rp0,mhd-head=0,id=dev1,tunnel=true
> -device cxl-mhd-sld,mhd=backend,bus=rp1,mhd-head=1,id=dev2
> 
> dev1 provides the tunneling interface, but the actual implementation of
> the pool CCI and actual memory mappings is in the backend. Note that backend
> might be proxy to an external process, or a client/server approach between 
> multiple
> QEMU instances.

I've hummed and hawwed over external process vs another QEMU instance and I
still haven't come to a satisfying answer here.  It feels extremely
heavy-handed to use an entirely separate QEMU instance just for this,
but there's nothing to say you can't just host it in one of the
head-attached instances.

I basically skipped this and allowed each instance to send the command
themselves, but serialized it with a mutex.  That way each instance can
operate cleanly without directly coordinating with each other.  I could
see a vendor implementing it this way on early devices.

I don't have a good answer for this yet, but maybe once I review the DCD
patch set I'll have more opinions.

> 
> or squish some parts and make a more extensible type3 device and have.
> 
> -device 
> cxl-type3,volatile-memdev=mem0,bus=rp0,mhd-head=0,id=dev1,mhd-main=true
> -device cxl-type3,mhd=dev1,bus=rp1,mhd-head=1,id=dev2
> 

I originally went this route, but the downside of this is "What happens
when the main dies and has to restart".  There's all of kinds of
badness associated with that.  It's why i moved the shared state into a
separately created mkipc region.

> 
> To my mind there are a series of steps and questions here.
> 
> Which 'hotplug model'.
> 1) LD model for moving capacity
>   - If doing LD model, do MLDs and configurable switches first. Needed as a 
> step along the
>     path anyway.  Deal with all the mess that brings and come back to MHD - 
> as you note it
>     only makes sense with a switch in the path, so MLDs are a subset of the 
> functionality anyway.
> 
> 2) DCD model for moving cacapcity
>   - MH-SLD with a pool CCI used to do DCD operations on the LDs.  Extension of
>     what Fan Ni is looking at.  He's making an SLD pretend to be a device
>     where DCD makes sense - whilst still using the CXL type 3 device. We 
> probably shouldn't
>     do that without figuring out how to do an MHD-SLD - or at least a head 
> that we intend
>     to hang this new stuff off - potentially just using the existing type 3 
> device with
>     more parameters as one of the MH-SLD heads that doesn't have the control 
> interface and
>     different parameters if it does have the tunnel to the Pool CCI.
> 

Personally I think we should focus on the DCD model.  In fact, I think
we're already very close to this, as my personal prototype showed this
can work fairly cleanly, and I imagine I'll have a quick MHD patch set
once I get the change to review the DCD patch set.

If I'm being the honest, the more I look at the LD model, the less I
like it, but I understand that's how scale is going to be achieved.  I
don't know if focusing on that design right now is going to produce
adoption in the short term, since we're not likely to see those devices
for a few years.

MH-SLD+DCD is likely to show up much sooner, so I will target that.

~Gregory

Re: [RFC] cxl: Multi-headed device design

Reply via email to