On Thu, 12 Dec 2024 14:52:02 -0500 Gregory Price <gou...@gourry.net> wrote:
> On Thu, Dec 12, 2024 at 05:40:16PM +0000, Jonathan Cameron via wrote: > > On Fri, 18 Oct 2024 12:12:52 -0400 > > Gregory Price <gou...@gourry.net> wrote: > > > > > From: Svetly Todorov <svetly.todo...@memverge.com> > > > > > > The shared state file only needs to be intialized once. Even if a guest > > > dies without clearing the ownership bits associated with its head-ID, > > > future guests with that ID will clear those bits in cxl_mhsld_realize(), > > > regardless of whether mhd_init is true or false. > > > > That sounds like a race condition if not all hosts are brought > > up before the first add. > > > > We weighed this against having to do an external setup like > > # SHMID = ./create_sharedmem.sh > # ./launch_qemu --shmid=$SHMID > > Which is what the original non-generalized prototype did. > > So yeah, there's a race condition AND a footgun (setting init AFTER > qemu instances are already using the memory will blow the state away). > > This was intended. > > As you allude to in the next chunk - the only real way to get around > this is to have an entirely external process serialize access. > > > > > > > The following command line options create an MHSLD with 4GB of > > > backing memory, whose state is tracked in /dev/shm/mhd_metadata. > > > --mhd-init=true tells this instance to initialize the state as > > > described above. > > > > > > ./qemu-system_x86-64 \ > > > [... other options ...] \ > > > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \ > > > -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,port=0,slot=0 \ > > > -object memory-backend-ram,id=mem0,size=4G \ > > > -device > > > cxl-mhsld,bus=rp0,num-dc-regions=1,volatile-dc-memdev=mem0,id=cxl-mem0,sn=66667,mhd-head=0,mhd-state_file=mhd_metadata,mhd-init=true > > > \ > > > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G \ > > > -qmp unix:/tmp/qmp-sock-1,server,nowait > > > > > > Once this guest completes setup, other guests looking to access the > > > device can be booted with the same configuration options, but with > > > --mhd-head != 0, > > > --mhd-init=false, > > > and a different QMP socket. > > > > > > Signed-off-by: Gregory Price <gou...@gourry.net> > > > Signed-off-by: Svetly Todorov <svetly.todo...@memverge.com> > > > > A few trivial things inline. > > > > In general the scheme looks workable but I'm not sure the contraints at > > setup time > > etc are suitable for an upstream solution. Certainly a useful tool to have > > for kernel development though so I'll try and find time in next few days to > > apply > > this on my gitlab tree. > > > > Longer term I think we need a more complex external program or a main / > > proxy > > type arrangement so that ordering requirements can be enforce > > I marginally disagree. We have to check ownership during memory use. > > We should try not to have an external process dependency for deferencing a > pointer backed by an emulated DCD device. The current overhead is bad enough. > > The shared memory use here mostly limits that overhead to cache invalidation, > and keeps the entire system fairly simple. > > All this is to say - we err'd on the side of keeping it simple, even if it > has a few stupid footguns. Obviously open to ideas, though. Need to mess around with it, but I don't in general think each host needs to see the global protection map - they just need to know what they can access. I don't think anyone is going to do 2MiB granularity on a real device (or at least not for bulk capacity) so the necessary data may not be that large. Obviously the cost of updating it will be higher, but meh, that's not that common an operation. > > > and we can have > > richer info. Having to chat to each qmp interface independently works fine > > is > > also a bit more complex than I think we would eventually want. > > > > This is a small component in someone's fabric manager that translates their > requests into QMP commands. Whatever we ultimately decide on, the complexity > here is about the same. It's the qmp commands to a bunch of different qemu instances pretending to be one Multi Head device. Sure we can insert a proxy to act as that one, but then we more or less have an external process mediating that. > > > Having a solution in place though will make it much easier to move towards > > an eventual upstreamable solution. This is a great place to start from! > > > > Jonathan > > > > > +/* > > > + * We limit the number of heads to prevent the shared state > > > + * region from becoming a major memory hog. We need 512MB of > > > + * memory space to track 8-host ownership of 4GB of memory in > > > + * blocks of 2MB. This can change if the block size is increased. > > > > I'm lost what makes up that size? > > > > I think the math is wrong here, we may have calculated based on > a larger capacity. I need to go back and look at how we came to 512MB. 4TB maybe? Jonathan > > ~Gregory