Hi Laszlo, On 4/2/19 12:33 PM, Laszlo Ersek wrote: > On 04/02/19 09:42, Igor Mammedov wrote: >> On Mon, 1 Apr 2019 15:07:05 +0200 >> Laszlo Ersek <ler...@redhat.com> wrote: >> >>> On 03/29/19 14:56, Auger Eric wrote: >>>> Hi Ard, >>>> >>>> On 3/29/19 2:14 PM, Ard Biesheuvel wrote: >>>>> On Fri, 29 Mar 2019 at 14:12, Auger Eric <eric.au...@redhat.com> wrote: >>>>>> >>>>>> Hi Shameer, >>>>>> >>>>>> On 3/29/19 10:59 AM, Shameerali Kolothum Thodi wrote: >>>>>>> >>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: Auger Eric [mailto:eric.au...@redhat.com] >>>>>>>> Sent: 29 March 2019 09:32 >>>>>>>> To: Shameerali Kolothum Thodi <shameerali.kolothum.th...@huawei.com>; >>>>>>>> qemu-devel@nongnu.org; qemu-...@nongnu.org; imamm...@redhat.com; >>>>>>>> peter.mayd...@linaro.org; shannon.zha...@gmail.com; >>>>>>>> sa...@linux.intel.com; sebastien.bo...@intel.com >>>>>>>> Cc: Linuxarm <linux...@huawei.com>; xuwei (O) <xuw...@huawei.com>; >>>>>>>> Laszlo Ersek <ler...@redhat.com>; Ard Biesheuvel >>>>>>>> <ard.biesheu...@linaro.org>; Leif Lindholm <leif.lindh...@arm.com> >>>>>>>> Subject: Re: [PATCH v3 07/10] hw/arm/virt: Introduce opt-in feature >>>>>>>> "fdt" >>>>>>>> >>>>>>>> Hi Shameer, >>>>>>>> >>>>>>>> [ + Laszlo, Ard, Leif ] >>>>>>>> >>>>>>>> On 3/21/19 11:47 AM, Shameer Kolothum wrote: >>>>>>>>> This is to disable/enable populating DT nodes in case >>>>>>>>> any conflict with acpi tables. The default is "off". >>>>>>>> The name of the option sounds misleading to me. Also we don't really >>>>>>>> know the scope of the disablement. At the moment this just aims to >>>>>>>> prevent the hotpluggable dt nodes from being added if we boot in ACPI >>>>>>>> mode. >>>>>>>> >>>>>>>>> >>>>>>>>> This will be used in subsequent patch where cold plug >>>>>>>>> device-memory support is added for DT boot. >>>>>>>> I am concerned about the fact that in dt mode, by default, you won't >>>>>>>> see >>>>>>>> any PCDIMM nodes. >>>>>>>>> >>>>>>>>> If DT memory node support is added for cold-plugged device >>>>>>>>> memory, those memory will be visible to Guest kernel via >>>>>>>>> UEFI GetMemoryMap() and gets treated as early boot memory. >>>>>>>> Don't we have an issue in UEFI then. Normally the SRAT indicates >>>>>>>> whether >>>>>>>> the slots are hotpluggable or not. Shouldn't the UEFI code look at this >>>>>>>> info. >>>>>>> >>>>>>> Sorry I missed this part. Yes, that will be a more cleaner solution. >>>>>>> >>>>>>> Also, to be more clear on what happens, >>>>>>> >>>>>>> Guest ACPI boot with "fdt=on" , >>>>>>> >>>>>>> From kernel log, >>>>>>> >>>>>>> [ 0.000000] Early memory node ranges >>>>>>> [ 0.000000] node 0: [mem 0x0000000040000000-0x00000000bbf5ffff] >>>>>>> [ 0.000000] node 0: [mem 0x00000000bbf60000-0x00000000bbffffff] >>>>>>> [ 0.000000] node 0: [mem 0x00000000bc000000-0x00000000bc02ffff] >>>>>>> [ 0.000000] node 0: [mem 0x00000000bc030000-0x00000000bc36ffff] >>>>>>> [ 0.000000] node 0: [mem 0x00000000bc370000-0x00000000bf64ffff] >>>>>>> [ 0.000000] node 0: [mem 0x00000000bf650000-0x00000000bf6dffff] >>>>>>> [ 0.000000] node 0: [mem 0x00000000bf6e0000-0x00000000bf6effff] >>>>>>> [ 0.000000] node 0: [mem 0x00000000bf6f0000-0x00000000bf80ffff] >>>>>>> [ 0.000000] node 0: [mem 0x00000000bf810000-0x00000000bfffffff] >>>>>>> [ 0.000000] node 0: [mem 0x00000000c0000000-0x00000000ffffffff] >>>>>>> --> This is the hotpluggable memory node from DT. >>>>>>> [ 0.000000] Zeroed struct page in unavailable ranges: 1040 pages >>>>>>> [ 0.000000] Initmem setup node 0 [mem >>>>>>> 0x0000000040000000-0x00000000ffffffff] >>>>>>> >>>>>>> >>>>>>> Guest ACPI boot with "fdt=off" , >>>>>>> >>>>>>> [ 0.000000] Movable zone start for each node >>>>>>> [ 0.000000] Early memory node ranges >>>>>>> [ 0.000000] node 0: [mem 0x0000000040000000-0x00000000bbf5ffff] >>>>>>> [ 0.000000] node 0: [mem 0x00000000bbf60000-0x00000000bbffffff] >>>>>>> [ 0.000000] node 0: [mem 0x00000000bc000000-0x00000000bc02ffff] >>>>>>> [ 0.000000] node 0: [mem 0x00000000bc030000-0x00000000bc36ffff] >>>>>>> [ 0.000000] node 0: [mem 0x00000000bc370000-0x00000000bf64ffff] >>>>>>> [ 0.000000] node 0: [mem 0x00000000bf650000-0x00000000bf6dffff] >>>>>>> [ 0.000000] node 0: [mem 0x00000000bf6e0000-0x00000000bf6effff] >>>>>>> [ 0.000000] node 0: [mem 0x00000000bf6f0000-0x00000000bf80ffff] >>>>>>> [ 0.000000] node 0: [mem 0x00000000bf810000-0x00000000bfffffff] >>>>>>> [ 0.000000] Zeroed struct page in unavailable ranges: 1040 pages >>>>>>> [ 0.000000] Initmem setup node 0 [mem >>>>>>> 0x0000000040000000-0x00000000bfffffff >>>>>>> >>>>>>> The hotpluggable memory node is absent from early memory nodes here. >>>>>> >>>>>> OK thank you for the example illustrating the concern. >>>>>>> >>>>>>> As you said, it could be possible to detect this node using SRAT in >>>>>>> UEFI. >>>>>> >>>>>> Let's wait for EDK2 experts on this. >>>>>> >>>>> >>>>> Happy to chime in, but I need a bit more context here. >>>>> >>>>> What is the problem, how does this path try to solve it, and why is >>>>> that a bad idea? >>>>> >>>> Sure, sorry. >>>> >>>> This series: >>>> - [PATCH v3 00/10] ARM virt: ACPI memory hotplug support, >>>> https://patchwork.kernel.org/cover/10863301/ >>>> >>>> aims to introduce PCDIMM support in qemu. In ACPI mode, it builds the >>>> SRAT and DSDT parts and relies on GED to trigger the hotplug. >>>> >>>> We noticed that if we build the hotpluggable memory dt nodes on top of >>>> the above ACPI tables, the DIMM slots are interpreted as not >>>> hotpluggable memory slots (at least we think so). >>>> >>>> We think the EDK2 GetMemoryMap() uses the dt node info and ignores the >>>> fact that those slots are exposed as hotpluggable in the SRAT for example. >>>> >>>> So in this series, we are forced to not generate the hotpluggable memory >>>> dt nodes if we want the DIMM slots to be effectively recognized as >>>> hotpluggable. >>>> >>>> Could you confirm we have a correct understanding of the EDK2 behaviour >>>> and if so, would there be any solution for EDK2 to absorb both the DT >>>> nodes and the relevant SRAT/DSDT tables and make the slots hotpluggable. >>>> >>>> At qemu level, detecting we are booting in ACPI mode and purposely >>>> removing the above mentioned DT nodes does not look straightforward. >>> >>> The firmware is not enlightened about the ACPI content that comes from >>> QEMU / fw_cfg. That ACPI content is *blindly* processed by the firmware, >>> as instructed through the ACPI linker/loader script, in order to install >>> the ACPI content for the OS. No actual information is consumed by the >>> firmware from the ACPI payload -- and that's a feature. >>> >>> The firmware does consume DT: >>> >>> - If you start QEMU *with* "-no-acpi", then the DT is both consumed by >>> the firmware (for its own information needs), and passed on to the OS. >>> >>> - If you start QEMU *without* "-no-acpi" (the default), then the DT is >>> consumed only by the firmware (for its own information needs), and the >>> DT is hidden from the OS. The OS gets only the ACPI content >>> (processed/prepared as described above). I am confused by the above statement actually. In the above case what does happen if you pass the acpi=off in the kernel boot parameters?
Thanks Eric >>> >>> >>> In the firmware, the "ArmVirtPkg/HighMemDxe" driver iterates over the >>> base/size pairs in all the memory nodes in the DT. For each such base >>> address that is currently tracked as "nonexistent" in the GCD memory >>> space map, the driver currently adds the base/size range as "system >>> memory". This in turn is reflected by the UEFI memmap that the OS gets >>> to see as "conventional memory". >>> >>> If you need some memory ranges to show up as "special" in the UEFI >>> memmap, then you need to distinguish them somehow from the "regular" >>> memory areas, in the DT. And then extend "ArmVirtPkg/HighMemDxe" in the >>> firmware, so that it act upon the discriminator that you set in the DT. >>> >>> >>> Now... from a brief look at the Platform Init and UEFI specs, my >>> impression is that the hotpluggable (but presently not plugged) DIMM >>> ranges should simply be *absent* from the UEFI memmap; is that correct? >>> (I didn't check the ACPI spec, maybe it specifies the expected behavior >>> in full.) If my impression is correct, then two options (alternatives) >>> exist: >>> >>> (1) Hide the affected memory nodes -- or at least the affected base/size >>> pairs -- from the DT, in case you boot without "-no-acpi" but with an >>> external firmware loaded. Then the firmware will not expose those ranges >>> as "conventional memory" in the UEFI memmap. This approach requires no >>> changes to edk2. >>> >>> This option is precisely what Eric described up-thread, at >>> <3f0a5793-dd35-a497-2248-8eb0cd3c3a16@redhat.com">http://mid.mail-archive.com/3f0a5793-dd35-a497-2248-8eb0cd3c3a16@redhat.com>: >>> >>>> in machvirt_init, there is firmware_loaded that tells you whether you >>>> have a FW image. If this one is not set, you can induce dt. But if >>>> there is a FW it can be either DT or ACPI booted. You also have the >>>> acpi_enabled knob. >>> >>> (The "-no-acpi" cmdline option clears the "acpi_enabled" variable in >>> "vl.c"). >>> >>> So, the condition for hiding the hotpluggable memory nodes in question >>> from the DT is: >>> >>> (aarch64 && firmware_loaded && acpi_enabled) >> I'd go with this one, though I have a question for firmware side. >> Let's assume we would want in future to expose hotpluggable & present >> memory via GetMemoryMap() (like bare-metal does) (guest OS theoretically >> can avoid using it for Normal zone based on hint from SRAT table early >> at boot), but what about firmware can it inspect SRAT table and not use >> hotpluggable ranges for its own use (or at least do not canibalize >> them permanently)? > > This is actually two questions: > > (a) Can the firmware inspect SRAT? > > If the SRAT table structure isn't very complex, this is technically > doable, but the wrong thing to do, IMO. > > First, we've tried hard to avoid enlightening the firmware about the > semantics of QEMU's ACPI tables. > > Second, this would introduce an ordering constraint (or callbacks) in > the firmware, between the driver that processes & installs the ACPI > tables, and the driver that translates the memory nodes of the DT to the > memory ranges known to UEFI and the OS. > > If we need such hinting, then option (2) below (from earlier context) > would be better: > - If it's OK to use an arm/aarch64 specific solution, then new DT > properties should work. > - If it should be arch-independent, then a dedicated fw_cfg file would > be better. > > (b) Assuming we have the information from some source, can the firmware > expose some memory ranges as "usable RAM" to the OS, while staying away > from them for its own (firmware) purposes? > > After consulting > > Table 25. Memory Type Usage before ExitBootServices() > Table 26. Memory Type Usage after ExitBootServices() > > in UEFI-2.7, I would say that the firmware driver that installs these > ranges to the memory (space) map should also allocate the ranges right > after, as EfiBootServicesData. This will prevent other drivers / > applications in the firmware from allocating chunks out of those areas, > and the OS will be at liberty to release and repurpose the ranges after > ExitBootServices(). > > Thanks, > Laszlo > >>> (2) Invent and set an "ignore me, firmware" property for the >>> hotpluggable memory nodes in the DT, and update the firmware to honor >>> that property. >>> >>> Thanks >>> Laszlo >> >