On Thu, 10 Mar 2022 16:02:22 +0800 Peter Xu <pet...@redhat.com> wrote:
> On Wed, Mar 09, 2022 at 11:28:27AM +0000, Jonathan Cameron wrote: > > Hi Peter, > > Hi, Jonathan, > > > > > > > > > https://lore.kernel.org/qemu-devel/20220306174137.5707-35-jonathan.came...@huawei.com/ > > > > > > Having mr->ops set but with memory_access_is_direct() returning true > > > sounds > > > weird to me. > > > > > > Sorry to have no understanding of the whole picture, but.. could you share > > > more on what's the interleaving requirement on the proxying, and why it > > > can't be done with adding some IO memory regions as sub-regions upon the > > > file one? > > > > The proxying requirement is simply a means to read/write to a computed > > address > > within a memory region. There may well be a better way to do that. > > > > If I understand your suggestion correctly you would need a very high > > number of IO memory regions to be created dynamically when particular sets > > of > > registers across multiple devices in the topology are all programmed. > > > > The interleave can be 256 bytes across up to 16x, many terabyte, devices. > > So assuming a simple set of 16 1TB devices I think you'd need about 4x10^9 > > IO regions. Even for a minimal useful test case of largest interleave > > set of 16x 256MB devices (256MB is minimum size the specification allows per > > decoded region at the device) and 16 way interleave we'd need 10^6 IO > > regions. > > Any idea if that approach would scale sensibly to this number of regions? > > > > There are also complexities to getting all the information in one place to > > work out which IO memory regions maps where in PA space. Current solution is > > to do that mapping in the same way the hardware does which is hierarchical, > > so we walk the path to the device, picking directions based on each > > interleave > > decoder that we meet. > > Obviously this is a bit slow but I only really care about correctness at the > > moment. I can think of various approaches to speeding it up but I'm not > > sure > > if we will ever care about performance. > > > > https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/hw/cxl/cxl-host.c#L131 > > has the logic for that and as you can see it's fairly simple because we are > > always > > going down the topology following the decoders. > > > > Below I have mapped out an algorithm I think would work for doing it with > > IO memory regions as subregions. > > > > We could fake the whole thing by limiting ourselves to small host > > memory windows which are always directly backed, but then I wouldn't > > achieve the main aim of this which is to provide a test base for the OS > > code. > > To do that I need real interleave so I can seed the files with test patterns > > and verify the accesses hit the correct locations. Emulating what the > > hardware > > is actually doing on a device by device basis is the easiest way I have > > come up with to do that. > > > > Let me try to provide some more background so you hopefully don't have > > to have read the specs to follow what is going on! > > There are an example for directly connected (no switches) topology in the > > docs > > > > https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/docs/system/devices/cxl.rst > > > > The overall picture is we have a large number of CXL Type 3 memory devices, > > which at runtime (by OS at boot/on hotplug) are configured into various > > interleaving sets with hierarchical decoding at the host + host bridge > > + switch levels. For test setups I probably need to go to around 32 devices > > so I can hit various configurations simultaneously. > > No individual device has visibility of the full interleave setup - hence > > the walk in the existing code through the various decoders to find the > > final Device Physical address. > > > > At the host level the host provides a set of Physical Address windows with > > a fixed interleave decoding across the different host bridges in the system > > (CXL Fixed Memory windows, CFMWs) > > On a real system these have to be large enough to allow for any memory > > devices that might be hotplugged and all possible configurations (so > > with 2 host bridges you need at least 3 windows in the many TB range, > > much worse as the number of host bridges goes up). It'll be worse than > > this when we have QoS groups, but the current Qemu code just puts all > > the windows in group 0. Hence my first thought of just putting memory > > behind those doesn't scale (a similar approach to this was in the > > earliest versions of this patch set - though the full access path > > wasn't wired up). > > > > The granularity can be in powers of 2 from 256 bytes to 16 kbytes > > > > Next each host bridge has programmable address decoders which take the > > incoming (often already interleaved) memory access and direct them to > > appropriate root ports. The root ports can be connected to a switch > > which has additional address decoders in the upstream port to decide > > which downstream port to route to. Note we currently only support 1 level > > of switches but it's easy to make this algorithm recursive to support > > multiple switch levels (currently the kernel proposals only support 1 level) > > > > Finally the End Point with the actual memory receives the interleaved > > request and > > takes the full address and (for power of 2 decoding - we don't yet support > > 3,6 and 12 way which is more complex and there is no kernel support yet) > > it drops a few address bits and adds an offset for the decoder used to > > calculate it's own device physical address. Note device will support > > multiple interleave sets for different parts of it's file once we add > > multiple decoder support (on the todo list). > > > > So the current solution is straight forward (with the exception of that > > proxying) because it follows the same decoding as used in real hardware > > to route the memory accesses. As a result we get a read/write to a > > device physical address and hence proxy that. If any of the decoders > > along the path are not configured then we error out at that stage. > > > > To create the equivalent as IO subregions I think we'd have to do the > > following from (this might be mediated by some central entity that > > doesn't currently exist, or done on demand from which ever CXL device > > happens to have it's decoder set up last) > > > > 1) Wait for a decoder commit (enable) on any component. Goto 2. > > 2) Walk the topology (up to host decoder, down to memory device) > > If a complete interleaving path has been configured - > > i.e. we have committed decoders all the way to the memory > > device goto step 3, otherwise return to step 1 to wait for > > more decoders to be committed. > > 3) For the memory region being supplied by the memory device, > > add subregions to map the device physical address (address > > in the file) for each interleave stride to the appropriate > > host Physical Address. > > 4) Return to step 1 to wait for more decoders to commit. > > > > So summary is we can do it with IO regions, but there are a lot of them > > and the setup is somewhat complex as we don't have one single point in > > time where we know all the necessary information is available to compute > > the right addresses. > > > > Looking forward to your suggestions if I haven't caused more confusion! Hi Peter, > > Thanks for the write up - I must confess they're a lot! :) > > I merely only learned what is CXL today, and I'm not very experienced on > device modeling either, so please bare with me with stupid questions.. > > IIUC so far CXL traps these memory accesses using CXLFixedWindow.mr. > That's a normal IO region, which looks very reasonable. > > However I'm confused why patch "RFC: softmmu/memory: Add ops to > memory_region_ram_init_from_file" helped. > > Per my knowledge, all the memory accesses upon this CFMW window trapped > using this IO region already. There can be multiple memory file objects > underneath, and when read/write happens the object will be decoded from > cxl_cfmws_find_device() as you referenced. Yes. > > However I see nowhere that these memory objects got mapped as sub-regions > into parent (CXLFixedWindow.mr). Then I don't understand why they cannot > be trapped. AS you note they aren't mapped into the parent mr, hence we are trapping. The parent mem_ops are responsible for decoding the 'which device' + 'what address in device memory space'. Once we've gotten that info the question is how do I actually do the access? Mapping as subregions seems unwise due to the huge number required. > > To ask in another way: what will happen if you simply revert this RFC > patch? What will go wrong? The call to memory_region_dispatch_read() https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/hw/mem/cxl_type3.c#L556 would call memory_region_access_valid() that calls mr->ops->valid.accepts() which is set to unassigned_mem_accepts() and hence... you get back a MEMTX_DECODE_ERROR back and an exception in the guest. That wouldn't happen with a non proxied access to the ram as those paths never uses the ops as memory_access_is_direct() is called and simply memcpy used without any involvement of the ops. Is a better way to proxy those writes to the backing files? I was fishing a bit in the dark here and saw the existing ops defined for a different purpose for VFIO 4a2e242bbb ("memory Don't use memcpy for ram_device regions") and those allowed the use of memory_region_dispatch_write() to work. Hence the RFC marking on that patch :) Thanks, Jonathan > > Thanks, >