Re: [Qemu-devel] [RFC 3/3] acpi-build: allocate mcfg for multiple host bridges

Michael S. Tsirkin Tue, 22 May 2018 14:23:59 -0700

On Tue, May 22, 2018 at 03:17:32PM -0600, Alex Williamson wrote:
> On Tue, 22 May 2018 21:51:47 +0200
> Laszlo Ersek <ler...@redhat.com> wrote:
> 
> > On 05/22/18 21:01, Marcel Apfelbaum wrote:
> > > Hi Laszlo,
> > > 
> > > On 05/22/2018 12:52 PM, Laszlo Ersek wrote:  
> > >> On 05/21/18 13:53, Marcel Apfelbaum wrote:  
> > >>>
> > >>> On 05/20/2018 10:28 AM, Zihan Yang wrote:  
> > >>>> Currently only q35 host bridge us allocated space in MCFG table. To
> > >>>> put pxb host
> > >>>> into sepratate pci domain, each of them should have its own
> > >>>> configuration space
> > >>>> int MCFG table
> > >>>>
> > >>>> Signed-off-by: Zihan Yang <whois.zihan.y...@gmail.com>
> > >>>> ---
> > >>>>    hw/i386/acpi-build.c | 83
> > >>>> +++++++++++++++++++++++++++++++++++++++-------------
> > >>>>    1 file changed, 62 insertions(+), 21 deletions(-)
> > >>>>
> > >>>> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> > >>>> index 9bc6d97..808d815 100644
> > >>>> --- a/hw/i386/acpi-build.c
> > >>>> +++ b/hw/i386/acpi-build.c
> > >>>> @@ -89,6 +89,7 @@
> > >>>>    typedef struct AcpiMcfgInfo {
> > >>>>        uint64_t mcfg_base;
> > >>>>        uint32_t mcfg_size;
> > >>>> +    struct AcpiMcfgInfo *next;
> > >>>>    } AcpiMcfgInfo;
> > >>>>      typedef struct AcpiPmInfo {
> > >>>> @@ -2427,14 +2428,15 @@ build_mcfg_q35(GArray *table_data, BIOSLinker
> > >>>> *linker, AcpiMcfgInfo *info)
> > >>>>    {
> > >>>>        AcpiTableMcfg *mcfg;
> > >>>>        const char *sig;
> > >>>> -    int len = sizeof(*mcfg) + 1 * sizeof(mcfg->allocation[0]);
> > >>>> +    int len, count = 0;
> > >>>> +    AcpiMcfgInfo *cfg = info;
> > >>>>    +    while (cfg) {
> > >>>> +        ++count;
> > >>>> +        cfg = cfg->next;
> > >>>> +    }
> > >>>> +    len = sizeof(*mcfg) + count * sizeof(mcfg->allocation[0]);
> > >>>>        mcfg = acpi_data_push(table_data, len);
> > >>>> -    mcfg->allocation[0].address = cpu_to_le64(info->mcfg_base);
> > >>>> -    /* Only a single allocation so no need to play with segments */
> > >>>> -    mcfg->allocation[0].pci_segment = cpu_to_le16(0);
> > >>>> -    mcfg->allocation[0].start_bus_number = 0;
> > >>>> -    mcfg->allocation[0].end_bus_number =
> > >>>> PCIE_MMCFG_BUS(info->mcfg_size - 1);
> > >>>>          /* MCFG is used for ECAM which can be enabled or disabled by
> > >>>> guest.
> > >>>>         * To avoid table size changes (which create migration issues),
> > >>>> @@ -2448,6 +2450,17 @@ build_mcfg_q35(GArray *table_data, BIOSLinker
> > >>>> *linker, AcpiMcfgInfo *info)
> > >>>>        } else {
> > >>>>            sig = "MCFG";
> > >>>>        }
> > >>>> +
> > >>>> +    count = 0;
> > >>>> +    while (info) {
> > >>>> +        mcfg[count].allocation[0].address =
> > >>>> cpu_to_le64(info->mcfg_base);
> > >>>> +        /* Only a single allocation so no need to play with
> > >>>> segments */
> > >>>> +        mcfg[count].allocation[0].pci_segment = cpu_to_le16(count);
> > >>>> +        mcfg[count].allocation[0].start_bus_number = 0;
> > >>>> +        mcfg[count++].allocation[0].end_bus_number =
> > >>>> PCIE_MMCFG_BUS(info->mcfg_size - 1);  
> > >>> An interesting point is if we want to limit the MMFCG size for PXBs, as
> > >>> we may not be
> > >>> interested to use all the buses in a specific domain. For each bus we
> > >>> require some
> > >>> address space that remains unused.
> > >>>  
> > >>>> +        info = info->next;
> > >>>> +    }
> > >>>> +
> > >>>>        build_header(linker, table_data, (void *)mcfg, sig, len, 1,
> > >>>> NULL, NULL);
> > >>>>    }
> > >>>>    @@ -2602,26 +2615,52 @@ struct AcpiBuildState {
> > >>>>        MemoryRegion *linker_mr;
> > >>>>    } AcpiBuildState;
> > >>>>    -static bool acpi_get_mcfg(AcpiMcfgInfo *mcfg)
> > >>>> +static inline void cleanup_mcfg(AcpiMcfgInfo *mcfg)
> > >>>> +{
> > >>>> +    AcpiMcfgInfo *tmp;
> > >>>> +    while (mcfg) {
> > >>>> +        tmp = mcfg->next;
> > >>>> +        g_free(mcfg);
> > >>>> +        mcfg = tmp;
> > >>>> +    }
> > >>>> +}
> > >>>> +
> > >>>> +static AcpiMcfgInfo *acpi_get_mcfg(void)
> > >>>>    {
> > >>>>        Object *pci_host;
> > >>>>        QObject *o;
> > >>>> +    AcpiMcfgInfo *head = NULL, *tail, *mcfg;
> > >>>>          pci_host = acpi_get_i386_pci_host();
> > >>>>        g_assert(pci_host);
> > >>>>    -    o = object_property_get_qobject(pci_host, PCIE_HOST_MCFG_BASE,
> > >>>> NULL);
> > >>>> -    if (!o) {
> > >>>> -        return false;
> > >>>> +    while (pci_host) {
> > >>>> +        mcfg = g_new0(AcpiMcfgInfo, 1);
> > >>>> +        mcfg->next = NULL;
> > >>>> +        if (!head) {
> > >>>> +            tail = head = mcfg;
> > >>>> +        } else {
> > >>>> +            tail->next = mcfg;
> > >>>> +            tail = mcfg;
> > >>>> +        }
> > >>>> +
> > >>>> +        o = object_property_get_qobject(pci_host,
> > >>>> PCIE_HOST_MCFG_BASE, NULL);
> > >>>> +        if (!o) {
> > >>>> +            cleanup_mcfg(head);
> > >>>> +            g_free(mcfg);
> > >>>> +            return NULL;
> > >>>> +        }
> > >>>> +        mcfg->mcfg_base = qnum_get_uint(qobject_to(QNum, o));
> > >>>> +        qobject_unref(o);
> > >>>> +
> > >>>> +        o = object_property_get_qobject(pci_host,
> > >>>> PCIE_HOST_MCFG_SIZE, NULL);  
> > >>> I'll let Igor to comment on the APCI bits, but the idea of querying each
> > >>> PCI host
> > >>> for the MMFCG details and put it into a different table sounds good
> > >>> to me.
> > >>>
> > >>> [Adding Laszlo for his insights]  
> > >> Thanks for the CC -- I don't have many (positive) insights here to
> > >> offer, I'm afraid. First, the patch set (including the blurb) doesn't
> > >> seem to explain *why* multiple host bridges / ECAM areas are a good
> > >> idea.  
> > > 
> > > The issue we want to solve is the hard limitation of 256 PCIe devices
> > > that can be used in a Q35 machine.
> 
> Isn't it interesting that conventional PCI can easily support so many
> more devices?  Sorta makes you wonder why we care that virtual devices
> are express rather than conventional for a high density configuration...


Because they are express in real life?

> > Implying that there are use cases for which ~256 PCIe devices aren't
> > enough. I remain unconvinced until proved wrong :)
> > 
> > Anyway, a significant source of waste comes from the restriction that we
> > can only put 1 device (with up to 8 functions) on each non-root-complex
> > PCI Express bus (such as root ports and downstream ports). This forfeits
> > a huge portion of the ECAM area (about 31/32th) that we already have.
> > Rather than spending more MMIO guest-phys address range on new
> > discontiguous ECAM ranges, I'd prefer if we could look into ARI. I seem
> > to recall from your earlier presentation that ARI could recover that
> > lost address space (meaning both ECAM ranges and PCIe B/D/F address space).
> 
> How does ARI solve the hotplug problem?  ARI is effectively
> multifunction on steroids, the ARI capability in each function points
> to the next function number so that we don't need to scan the entire
> devfn address space per bus (an inefficiency we don't care about when
> there are only 8 function).  So yeah, we can fill an entire bus with
> devices with ARI, but they're all rooted at 00.0.
>  
> > There are signs that the edk2 core supports ARI if the underlying
> > platform supports it. (Which is not the case with multiple PCIe domains
> > / multiple ECAM ranges.)
> 
> It's pretty surprising to me that edk2 wouldn't already have support
> for multiple PCIe domains, they're really not *that* uncommon.  Some
> architectures make almost gratuitous use of PCIe domains.  I certainly
> know there were UEFI ia64 platforms with multiple domains.
> 
> > ARI support could also help aarch64/virt. Eric (CC'd) has been working
> > on raising the max PCIe bus count from *16* to 256 for aarch64/virt, and
> > AFAIR one of the challenges there has been finding a good spot for the
> > larger ECAM in guest-phys address space. Fiddling with such address maps
> > is always a pain.
> > 
> > Back to x86, the guest-phys address space is quite crowded too. The
> > 32-bit PCI MMIO aperture (with the neighboring ECAM and platform MMIO
> > areas such as LAPIC, IO-APIC, HPET, TPM, pflash, ...) is always a scarce
> > resource. Plus, reaching agreement between OVMF and QEMU on the exact
> > location of the 32-bit PCI MMIO aperture has always been a huge pain; so
> > you'd likely shoot for 64-bit.
> 
> Why do we need to equally split 32-bit MMIO space between domains?  Put
> it on domain 0 and require devices installed into non-zero domains have
> no 32-bit dependencies.

+1

> > But 64-bit is ill-partitioned and/or crowded too: first you have the
> > cold-plugged >4GB DRAM (whose size the firmware can interrogate), then
> > the hotplug DIMM area up to "etc/reserved-memory-end" (ditto), then the
> > 64-bit PCI MMIO aperture (whose size the firmware decides itself,
> > because QEMU cannot be asked about it presently). Placing the additional
> > MMCFGs somewhere would need extending the total order between these
> > things at the design level, more fw_cfg files, more sanity checks in
> > platform code in the firmware, and, quite importantly, adding support to
> > multiple discontiguous MMCFGs to core edk2 code.
> 
> Hmm, we're doing something wrong if we can't figure out some standards
> within QEMU for placing per domain 64-bit MMIO and MMCFG ranges.

I thought in this patch it is done by firmware.

> > I don't know much about ARI but it already looks a whole lot more
> > attractive to me.
> > 
> > > We can use "PCI functions" to increase
> > > the number, but then we loose the hot-plugging granularity.
> > > 
> > > Currently pxb-pcie host bridges share the same PCI domain (0)
> > > bus numbers, so adding more of them will not result in more usable
> > > devices (each PCIe device resides on its own bus),
> > > but putting each of them on a different domain removes
> > > that limitation.
> > > 
> > > Another possible use (there is a good chance I am wrong, adding Alex to
> > > correct me),
> > > may be the "modeling" of a multi-function assigned device.
> > > Currently all the functions of an assigneddevice will be placed on
> > > different PCIe
> > > Root Ports "loosing" the connection between them.
> > > 
> > > However, if we put all the functions of the same assigned device in the
> > > same
> > > PCI domain we may be able to "re-connect" them.  
> > 
> > OK -- why is that useful? What's the current practical problem with the
> > splitting you describe?
> 
> There are very few cases where this is useful, things like associating
> USB companion devices or translating a guest bus reset to host bus
> reset when functions are split to separate virtual buses.  That said, I
> have no idea how multiple domains plays a factor here.  Regardless of
> how many PCIe domains a VM supports, we can easily use existing
> multifunction within a single domain to expose multifunction assigned
> devices in a way that better resembles the physical hardware.
> Multifunction PCIe endpoints cannot span PCIe domains.
> 
> In fact, IOMMUs generally cannot span domains either, so we better also
> be thinking about at least a VT-d DHRD or vIOMMU per PCIe domain.
>  
> > >>   Second, supporting such would take very intrusive patches / large
> > >> feature development for OVMF (and that's not just for OvmfPkg and
> > >> ArmVirtPkg platform code, but also core MdeModulePkg code).
> > >>
> > >> Obviously I'm not saying this feature should not be implemented for QEMU
> > >> + SeaBIOS, just that don't expect edk2 support for it anytime soon.
> > >> Without a convincing use case, even the review impact seems hard to
> > >> justify.  
> > > 
> > > I understand, thank you for the feedback, the point was to be sure
> > > we don't decide something that *can't be done* in OVMF.  
> > 
> > Semantics :) Obviously, everything "can be done" in software; that's why
> > it's *soft*ware. Who is going to write it in practice? It had taken
> > years until the edk2 core gained a universal PciHostBridgeDxe driver
> > with a well-defined platform customization interface, and that interface
> > doesn't support multiple domains / segments. It had also taken years
> > until the same edk2 core driver gained support for nonzero MMIO address
> > translation (i.e. where the CPU view of system RAM addresses differs
> > from the PCI device view of the same, for DMA purposes) -- and that's
> > just a linear translation, not even about IOMMUs. The PCI core
> > maintainers in edk2 are always very busy, and upstreaming such changes
> > tends to take forever.
> 
> Wow, that's surprising, ia64 was doing all of that on UEFI platforms a
> decade ago.
>  
> > Again, I'm not opposing this feature per se. I just see any potential
> > support for it in edk2 very difficult. I remain unconvinced that ~256
> > PCIe devices aren't enough. Even assuming I'm proved plain wrong there,
> > I'd still much prefer ARI (although I don't know much about it at this
> > point), because (a) the edk2 core already shows signs that it intends to
> > support ARI, (b) ARI would help aarch64/virt with nearly the same
> > effort, (c) it seems that ARI would address the problem (namely, wasting
> > MMCFG) head-on, rather than throwing yet more MMCFG at it.
> 
> -ENOHOTPLUG  From an assigned device perspective, this also implies
> that we're potentially adding a PCIe extended capability that may not
> exist on the physical device to the virtual view of the device.
> There's no guarantee that we can place that capability in the devices
> extended config space without stepping on registers that the hardware
> has hidden between the capabilities (for real, GPUs have registers in
> extended config space that aren't covered by capabilities) or the
> unlikely event that there's no more room in extended config space.
> Thanks,
> 
> Alex

Re: [Qemu-devel] [RFC 3/3] acpi-build: allocate mcfg for multiple host bridges

Reply via email to