On 10/19/17 11:30, Marcel Apfelbaum wrote: > > Hi Laszlo, > > On 18/10/2017 14:40, Laszlo Ersek wrote: >> Hi Marcel, >> >> On 10/18/17 11:58, Marcel Apfelbaum wrote: >>> Currently there is no MMIO range over 4G >>> reserved for PCI hotplug. Since the 32bit PCI hole >>> depends on the number of cold-plugged PCI devices >>> and other factors, it is very possible is too small >>> to hotplug PCI devices with large BARs. >>> >>> Fix it by reserving all the address space after >>> RAM (and the reserved space for RAM hotplug), >>> but no more than 40bits. The later seems to be >>> QEMU's magic number for the Guest physical CPU addressable >>> bits and is safe with respect to migration. >>> >>> Note this is a regression since prev QEMU versions had >>> some range reserved for 64bit PCI hotplug. >>> >>> Signed-off-by: Marcel Apfelbaum <mar...@redhat.com> >>> --- >>> hw/i386/pc.c | 22 ++++++++++++++++++++++ >>> hw/pci-host/piix.c | 10 ++++++++++ >>> hw/pci-host/q35.c | 9 +++++++++ >>> include/hw/i386/pc.h | 10 ++++++++++ >>> include/hw/pci-host/q35.h | 1 + >>> 5 files changed, 52 insertions(+) >> >> - What are the symptoms of the problem? >> > > PCI devices with *relatively* large BARs can't be hot-plugged. > The reason is QEMU does not reserve MMIO range over 4G and > uses whatever remains on the 32bit PCI window. > > If you coldplug enough PCI devices you will end up with > with a 64bit PCI window that covers only the cold-plugged PCI > devices, still no room for hotplug. > >> - What QEMU commit regressed the previous functionality? >> > > - commit 39848901818 pc: limit 64 bit hole to 2G by default > shows us QEMU had the 64bit PCI hole, so it is a regression.
(Side remark: this commit was part of release v1.6.0, and at that time we also had the "etc/pci-info" fw_cfg file.) > > - commit 2028fdf3791 piix: use 64 bit window programmed by guest > leaves no room for PCI hotplug (Side remark: part of v1.7.0, from 2013. So, we can call this a regression, but then it's a very old one. Of course this is *not* a counter-argument to your patch, or commit message wording; it's just that I'm surprised that the issue could remain dormant this long -- usually users report regressions very quickly.) > >> - How exactly is the regression (and this fix) exposed to the guest? >> > > The nuance is both above commits computed the actual MMIO > range used by cold-plugged devices, but as side effect > they influenced the remaining PCI window for hot-plugged devices. > > What the guest sees is the CRS MMIO range over 4G. OK, understood. > >> I'm confused because semi-recently I've done multiple successful >> hotplugs, with OVMF, Q35, PCIe root ports, and PCI Express devices >> having large (64-bit) MMIO BARs. The examples include ivshmem / Linux >> guest (based on your hints) and vfio-pci (XL710 NIC) / Windows Server >> 2016 guest. > > Cool! These are actually great news! > Back to the issue in hand, a few things: > 1. It works with Q35 only, but not with the I440FX machines. OK. > 2. The hints are used for the PCIe root ports, meaning for PCI bridges > and addresses a slightly different issues: leave hotplug space > for secondary buses, while this patch tackles hotplug > space for root buses. OK. > >> >> By default, OVMF marks a 32GB area as 64-bit MMIO aperture, for BAR >> allocation. This area is above the normal memory (plus reservation for >> hotplug memory, if any). It is also GB-aligned. The aperture size can be >> changed (currently) with an experimental -fw_cfg switch (the fw_cfg file >> is called "opt/ovmf/X-PciMmio64Mb", it states the size of the 64-bit >> MMIO aperture as a decimal number of megabytes). > > I'll try to explain how I see the "picture": > CRS is created by QEMU and computed *after* the firmware finishes > the PCI BARs allocations. Yes. > Then *QEMU* decides how big it will be the > 64bit PCI hole, not the firmware - see the past commits. Yes, first the guest-side programming happens, then QEMU (re-)calculates the ACPI payload once the right ACPI fw_cfg file is selected / read. This re-calculation (including the CRS) observes the previously programmed values. > > So IMHO you don't need the opt/ovmf/X-PciMmio64Mb since the OVMF > is not deciding this value. It actually should not care at all. OVMF cannot *not* care. It must internally provide (to other, platform-independent components in edk2) the 64-bit MMIO aperture that the BARs can be allocated from. But maybe I'm totally misunderstanding this patch! Are you saying that the patch makes the following difference only: when the CRS is recalculated (i.e., after the PCI enumeration has completed in the guest, and the ACPI linker/loader fw_cfg files are opened), then you only grow / round up the 64-bit range? That would make sense, because a *superset* aperture exposed to the OS via the _CRS does not invalidate any allocations done by the firmware earlier. Furthermore, the firmware itself need not learn about the details of said superset. In this sense I agree that OVMF should not care at all. [snip] >> Also, how do you plan to integrate this feature with SeaBIOS? (Ever since I sent my previous response, I've been glad that I finished my email with this question -- I really think OVMF and SeaBIOS should behave uniformly in this regard. So I'm happy about your answer:) > Simple, no need :) . It just works as it should. > SeaBIOS computes everything, *then* QEMU creates the CRs ranges > based on firmware input and other properties. > I think it works also with OVMF out of the box. OK, so this explains it all, thank you very much -- my question is if you could spell out those "other properties" in a bit more detail (I apologize if these should be obvious to me already, from the commit message and the patch): (1) How exactly does the enlarged 64-bit hole remain a *superset* of the firmware-programmed BARs? Can you point me to a location in the code? (2) What happens if we do have more than 1TB RAM? > A last note, when pxb/pxb-pcies will became hot-pluggable > we need to leave some bits for them too. I wouldn't > really want to add fw_config files for them too, or > use a complicate fw_config, or rewrite > the logic for all firmware if we don't have to. OK. > > 64bit PCI window starts from memory(+reserved) end > and finishes at 40bit. OK, this seems to (partly) answer my question (1) -- can you please confirm that the finishes at 40bit part does not imply forcefully truncating the CRS at 1TB, in case the firmware already allocated BARs higher than that? > If you have 1T of RAM you > get nothing for PCI hotplug, seems fair. This is fine (as the answer to my question (2)) as long as you mean it in the following sense: "QEMU will simply *not grow* the MMIO in the _CRS past 1TB". > When guests with more then 1T of RAM will become > frequent, I suppose the QEMU's default CPU addressable > bits will change and we will change it too. Thanks Marcel, I feel a lot safer now. Please confirm that the changes can only grow the area exposed in the _CRS -- i.e., only lower or preserve the base, and only raise or preserve the absolute limit. I'll shut up then :) Thanks! Laszlo