On Tue, May 22, 2018 at 10:28:56PM -0600, Alex Williamson wrote: > On Wed, 23 May 2018 02:38:52 +0300 > "Michael S. Tsirkin" <m...@redhat.com> wrote: > > > On Tue, May 22, 2018 at 03:47:41PM -0600, Alex Williamson wrote: > > > On Wed, 23 May 2018 00:44:22 +0300 > > > "Michael S. Tsirkin" <m...@redhat.com> wrote: > > > > > > > On Tue, May 22, 2018 at 03:36:59PM -0600, Alex Williamson wrote: > > > > > On Tue, 22 May 2018 23:58:30 +0300 > > > > > "Michael S. Tsirkin" <m...@redhat.com> wrote: > > > > > > > > > > > > It's not hard to think of a use-case where >256 devices > > > > > > are helpful, for example a nested virt scenario where > > > > > > each device is passed on to a different nested guest. > > > > > > > > > > > > But I think the main feature this is needed for is numa modeling. > > > > > > Guests seem to assume a numa node per PCI root, ergo we need more > > > > > > PCI > > > > > > roots. > > > > > > > > > > But even if we have NUMA affinity per PCI host bridge, a PCI host > > > > > bridge does not necessarily imply a new PCIe domain. > > > > > > > > What are you calling a PCIe domain? > > > > > > Domain/segment > > > > > > 0000:00:00.0 > > > ^^^^ This > > > > Right. So we can thinkably have PCIe root complexes share an ACPI segment. > > I don't see what this buys us by itself. > > The ability to define NUMA locality for a PCI sub-hierarchy while > maintaining compatibility with non-segment aware OSes (and firmware).
Fur sure, but NUMA is a kind of advanced topic, MCFG has been around for longer than various NUMA tables. Are there really non-segment aware guests that also know how to make use of NUMA? > > > Isn't that the only reason we'd need a new MCFG section and the reason > > > we're limited to 256 buses? Thanks, > > > > > > Alex > > > > I don't know whether a single MCFG section can describe multiple roots. > > I think it would be certainly unusual. > > I'm not sure here if you're referring to the actual MCFG ACPI table or > the MMCONFIG range, aka the ECAM. Neither of these describe PCI host > bridges. The MCFG table can describe one or more ECAM ranges, which > provides the ECAM base address, the PCI segment associated with that > ECAM and the start and end bus numbers to know the offset and extent of > the ECAM range. PCI host bridges would then theoretically be separate > ACPI objects with _SEG and _BBN methods to associate them to the > correct ECAM range by segment number and base bus number. So it seems > that tooling exists that an ECAM/MMCONFIG range could be provided per > PCI host bridge, even if they exist within the same domain, but in > practice what I see on systems I have access to is a single MMCONFIG > range supporting all of the host bridges. It also seems there are > numerous ways to describe the MMCONFIG range and I haven't actually > found an example that seems to use the MCFG table. Two have MCFG > tables (that don't seem terribly complete) and the kernel claims to > find the MMCONFIG via e820, another doesn't even have an MCFG table and > the kernel claims to find MMCONFIG via an ACPI motherboard resource. > I'm not sure if I can enable PCI segments on anything to see how the > firmware changes. Thanks, > > Alex Let me clarify. So MCFG have base address allocation structures. Each maps a segment and a range of bus numbers into memory. This structure is what I meant. IIUC you are saying on your systems everything is within a single segment, right? Multiple pci hosts map into a single segment? If you do this you can do NUMA, but do not gain > 256 devices. Are we are the same page then? -- MST