On Wed, 23 May 2018 02:38:52 +0300 "Michael S. Tsirkin" <m...@redhat.com> wrote:
> On Tue, May 22, 2018 at 03:47:41PM -0600, Alex Williamson wrote: > > On Wed, 23 May 2018 00:44:22 +0300 > > "Michael S. Tsirkin" <m...@redhat.com> wrote: > > > > > On Tue, May 22, 2018 at 03:36:59PM -0600, Alex Williamson wrote: > > > > On Tue, 22 May 2018 23:58:30 +0300 > > > > "Michael S. Tsirkin" <m...@redhat.com> wrote: > > > > > > > > > > It's not hard to think of a use-case where >256 devices > > > > > are helpful, for example a nested virt scenario where > > > > > each device is passed on to a different nested guest. > > > > > > > > > > But I think the main feature this is needed for is numa modeling. > > > > > Guests seem to assume a numa node per PCI root, ergo we need more PCI > > > > > roots. > > > > > > > > But even if we have NUMA affinity per PCI host bridge, a PCI host > > > > bridge does not necessarily imply a new PCIe domain. > > > > > > What are you calling a PCIe domain? > > > > Domain/segment > > > > 0000:00:00.0 > > ^^^^ This > > Right. So we can thinkably have PCIe root complexes share an ACPI segment. > I don't see what this buys us by itself. The ability to define NUMA locality for a PCI sub-hierarchy while maintaining compatibility with non-segment aware OSes (and firmware). > > Isn't that the only reason we'd need a new MCFG section and the reason > > we're limited to 256 buses? Thanks, > > > > Alex > > I don't know whether a single MCFG section can describe multiple roots. > I think it would be certainly unusual. I'm not sure here if you're referring to the actual MCFG ACPI table or the MMCONFIG range, aka the ECAM. Neither of these describe PCI host bridges. The MCFG table can describe one or more ECAM ranges, which provides the ECAM base address, the PCI segment associated with that ECAM and the start and end bus numbers to know the offset and extent of the ECAM range. PCI host bridges would then theoretically be separate ACPI objects with _SEG and _BBN methods to associate them to the correct ECAM range by segment number and base bus number. So it seems that tooling exists that an ECAM/MMCONFIG range could be provided per PCI host bridge, even if they exist within the same domain, but in practice what I see on systems I have access to is a single MMCONFIG range supporting all of the host bridges. It also seems there are numerous ways to describe the MMCONFIG range and I haven't actually found an example that seems to use the MCFG table. Two have MCFG tables (that don't seem terribly complete) and the kernel claims to find the MMCONFIG via e820, another doesn't even have an MCFG table and the kernel claims to find MMCONFIG via an ACPI motherboard resource. I'm not sure if I can enable PCI segments on anything to see how the firmware changes. Thanks, Alex