On 10/19/17 13:29, Marcel Apfelbaum wrote: > Hi Laszlo, > > On 19/10/2017 13:41, Laszlo Ersek wrote: >> On 10/19/17 11:30, Marcel Apfelbaum wrote: >>> >>> Hi Laszlo, >>> >>> On 18/10/2017 14:40, Laszlo Ersek wrote: >>>> Hi Marcel, >>>> >>>> On 10/18/17 11:58, Marcel Apfelbaum wrote: >>>>> Currently there is no MMIO range over 4G >>>>> reserved for PCI hotplug. Since the 32bit PCI hole >>>>> depends on the number of cold-plugged PCI devices >>>>> and other factors, it is very possible is too small >>>>> to hotplug PCI devices with large BARs. >>>>> >>>>> Fix it by reserving all the address space after >>>>> RAM (and the reserved space for RAM hotplug), >>>>> but no more than 40bits. The later seems to be >>>>> QEMU's magic number for the Guest physical CPU addressable >>>>> bits and is safe with respect to migration. >>>>> >>>>> Note this is a regression since prev QEMU versions had >>>>> some range reserved for 64bit PCI hotplug. >>>>> >>>>> Signed-off-by: Marcel Apfelbaum <mar...@redhat.com> >>>>> --- >>>>> hw/i386/pc.c | 22 ++++++++++++++++++++++ >>>>> hw/pci-host/piix.c | 10 ++++++++++ >>>>> hw/pci-host/q35.c | 9 +++++++++ >>>>> include/hw/i386/pc.h | 10 ++++++++++ >>>>> include/hw/pci-host/q35.h | 1 + >>>>> 5 files changed, 52 insertions(+) >>>> >>>> - What are the symptoms of the problem? >>>> >>> >>> PCI devices with *relatively* large BARs can't be hot-plugged. >>> The reason is QEMU does not reserve MMIO range over 4G and >>> uses whatever remains on the 32bit PCI window. >>> >>> If you coldplug enough PCI devices you will end up with >>> with a 64bit PCI window that covers only the cold-plugged PCI >>> devices, still no room for hotplug. >>> >>>> - What QEMU commit regressed the previous functionality? >>>> >>> >>> - commit 39848901818 pc: limit 64 bit hole to 2G by default >>> shows us QEMU had the 64bit PCI hole, so it is a regression. >> >> (Side remark: this commit was part of release v1.6.0, and at that time >> we also had the "etc/pci-info" fw_cfg file.) >> >>> >>> - commit 2028fdf3791 piix: use 64 bit window programmed by guest >>> leaves no room for PCI hotplug >> >> (Side remark: part of v1.7.0, from 2013. So, we can call this a >> regression, but then it's a very old one. >> >> Of course this is *not* a counter-argument to your patch, or commit >> message wording; it's just that I'm surprised that the issue could >> remain dormant this long -- usually users report regressions very >> quickly.) >> > > I also thought I am proposing a new feature and I was pretty amazed > I actually fix something that worked before. > >>> >>>> - How exactly is the regression (and this fix) exposed to the guest? >>>> >>> >>> The nuance is both above commits computed the actual MMIO >>> range used by cold-plugged devices, but as side effect >>> they influenced the remaining PCI window for hot-plugged devices. >>> >>> What the guest sees is the CRS MMIO range over 4G. >> >> OK, understood. >> >>> >>>> I'm confused because semi-recently I've done multiple successful >>>> hotplugs, with OVMF, Q35, PCIe root ports, and PCI Express devices >>>> having large (64-bit) MMIO BARs. The examples include ivshmem / Linux >>>> guest (based on your hints) and vfio-pci (XL710 NIC) / Windows Server >>>> 2016 guest. >>> >>> Cool! These are actually great news! >>> Back to the issue in hand, a few things: >>> 1. It works with Q35 only, but not with the I440FX machines. >> >> OK. >> >>> 2. The hints are used for the PCIe root ports, meaning for PCI bridges >>> and addresses a slightly different issues: leave hotplug space >>> for secondary buses, while this patch tackles hotplug >>> space for root buses. >> >> OK. >> >>> >>>> >>>> By default, OVMF marks a 32GB area as 64-bit MMIO aperture, for BAR >>>> allocation. This area is above the normal memory (plus reservation for >>>> hotplug memory, if any). It is also GB-aligned. The aperture size >>>> can be >>>> changed (currently) with an experimental -fw_cfg switch (the fw_cfg >>>> file >>>> is called "opt/ovmf/X-PciMmio64Mb", it states the size of the 64-bit >>>> MMIO aperture as a decimal number of megabytes). >>> >>> I'll try to explain how I see the "picture": >>> CRS is created by QEMU and computed *after* the firmware finishes >>> the PCI BARs allocations. >> >> Yes. >> >>> Then *QEMU* decides how big it will be the >>> 64bit PCI hole, not the firmware - see the past commits. >> >> Yes, first the guest-side programming happens, then QEMU (re-)calculates >> the ACPI payload once the right ACPI fw_cfg file is selected / read. >> This re-calculation (including the CRS) observes the previously >> programmed values. >> >>> >>> So IMHO you don't need the opt/ovmf/X-PciMmio64Mb since the OVMF >>> is not deciding this value. It actually should not care at all. >> >> OVMF cannot *not* care. It must internally provide (to other, >> platform-independent components in edk2) the 64-bit MMIO aperture that >> the BARs can be allocated from. >> >> >> But maybe I'm totally misunderstanding this patch! Are you saying that >> the patch makes the following difference only: when the CRS is >> recalculated (i.e., after the PCI enumeration has completed in the >> guest, and the ACPI linker/loader fw_cfg files are opened), then you >> only grow / round up the 64-bit range? >> > > Exactly. > >> That would make sense, because a *superset* aperture exposed to the OS >> via the _CRS does not invalidate any allocations done by the firmware >> earlier. Furthermore, the firmware itself need not learn about the >> details of said superset. In this sense I agree that OVMF should not >> care at all. >> > > Exactly, the range is a superset. > >> [snip] >> >>>> Also, how do you plan to integrate this feature with SeaBIOS? >> >> (Ever since I sent my previous response, I've been glad that I finished >> my email with this question -- I really think OVMF and SeaBIOS should >> behave uniformly in this regard. So I'm happy about your answer:) >> >>> Simple, no need :) . It just works as it should. >>> SeaBIOS computes everything, *then* QEMU creates the CRs ranges >>> based on firmware input and other properties. >>> I think it works also with OVMF out of the box. >> >> OK, so this explains it all, thank you very much -- my question is if >> you could spell out those "other properties" in a bit more detail (I >> apologize if these should be obvious to me already, from the commit >> message and the patch): >> > > "Other properties" was a vague link to max RAM, and things like that. > Nothing specific. > >> (1) How exactly does the enlarged 64-bit hole remain a *superset* of the >> firmware-programmed BARs? Can you point me to a location in the code? >> > > Sure in the patch itself: > We first take the firmware computed ranges: > pci_bus_get_w64_range(h->bus, &w64); > Look at the *get_pci_hole64_start -> If the firmware set it we don't > touch. > And to the *get_pci_hole64_end -> We only update the value if is smaller > than what firmware set. > > > >> (2) What happens if we do have more than 1TB RAM? >> > > You get no 64bit "hot-pluggable" reserved range. > The code dealing with that is: > > + if (hole64_start > PCI_HOST_PCI_HOLE64_END) { > + hole64_start = PCI_HOST_PCI_HOLE64_END; > + } > > And since the end is hard-coded to PCI_HOST_PCI_HOLE64_END > we have a 0 size region which gets discarded. > > An interesting case is if we have more than 1T RAM and the > firmware assigns BARs over that, we are increasing the > 64bit window to include some of the RAM... > > A better approach is to check against the actual hole64 > end and not the magic value. > I'll send a v3, thanks!
Ah, right; in v1 -- and possibly v2, I haven't looked yet -- hole64_start will be set to PCI_HOST_PCI_HOLE64_END, in preparation for the end to be set the exact same way (resulting in an empty range) -- however, the end won't be modified at all if it's already above PCI_HOST_PCI_HOLE64_END, so we'll end up with a non-empty range that intersects with RAM. Please CC me on v3 too; I feel that I'm now better equipped to comment :) Thanks! Laszlo > >>> A last note, when pxb/pxb-pcies will became hot-pluggable >>> we need to leave some bits for them too. I wouldn't >>> really want to add fw_config files for them too, or >>> use a complicate fw_config, or rewrite >>> the logic for all firmware if we don't have to. >> >> OK. >> >>> >>> 64bit PCI window starts from memory(+reserved) end >>> and finishes at 40bit. >> >> OK, this seems to (partly) answer my question (1) -- can you please >> confirm that the >> >> finishes at 40bit >> >> part does not imply forcefully truncating the CRS at 1TB, in case the >> firmware already allocated BARs higher than that? >> > > I confirm: > > + if (s->pci_hole64_fix && value < PCI_HOST_PCI_HOLE64_END) { > + value = PCI_HOST_PCI_HOLE64_END; > + } > > I touch the value only if is smaller than what the firmware set. > >>> If you have 1T of RAM you >>> get nothing for PCI hotplug, seems fair. >> >> This is fine (as the answer to my question (2)) as long as you mean it >> in the following sense: "QEMU will simply *not grow* the MMIO in the >> _CRS past 1TB". >> >>> When guests with more then 1T of RAM will become >>> frequent, I suppose the QEMU's default CPU addressable >>> bits will change and we will change it too. >> >> Thanks Marcel, I feel a lot safer now. >> >> Please confirm that the changes can only grow the area exposed in the >> _CRS -- i.e., only lower or preserve the base, and only raise or >> preserve the absolute limit. I'll shut up then :) >> > > Confirmed and thanks for the review! > > Thanks, > Marcel > >> Thanks! >> Laszlo >> >