On Thu, 4 Oct 2018 15:16:13 +0100 "Dr. David Alan Gilbert" <dgilb...@redhat.com> wrote:
> * Igor Mammedov (imamm...@redhat.com) wrote: > > On Thu, 4 Oct 2018 13:32:26 +0200 > > Auger Eric <eric.au...@redhat.com> wrote: > > > > > Hi Igor, > > > > > > On 10/4/18 1:11 PM, Igor Mammedov wrote: > > > > On Wed, 3 Oct 2018 15:49:03 +0200 > > > > Auger Eric <eric.au...@redhat.com> wrote: > > > > > > > >> Hi, > > > >> > > > >> On 7/3/18 9:19 AM, Eric Auger wrote: > > > >>> This series aims at supporting PCDIMM/NVDIMM intantiation in > > > >>> machvirt at 2TB guest physical address. > > > >>> > > > >>> This is achieved in 3 steps: > > > >>> 1) support more than 40b IPA/GPA > > > >>> 2) support PCDIMM instantiation > > > >>> 3) support NVDIMM instantiation > > > >> > > > >> While respinning this series I have some general questions that raise > > > >> up > > > >> when thinking about extending the RAM on mach-virt: > > > >> > > > >> At the moment mach-virt offers 255GB max initial RAM starting at 1GB > > > >> ("-m " option). > > > >> > > > >> This series does not touch this initial RAM and only targets to add > > > >> device memory (usable for PCDIMM, NVDIMM, virtio-mem, virtio-pmem) in > > > >> 3.1 machine, located at 2TB. 3.0 address map top currently is at 1TB > > > >> (legacy aarch32 LPAE limit) so it would leave 1TB for IO or PCI. Is it > > > >> OK? > > > >> > > > >> - Putting device memory at 2TB means only ARMv8/aarch64 would get > > > >> benefit of it. Is it an issue? ie. no device memory for ARMv7 or > > > >> ARMv8/aarch32. Do we need to put effort supporting more memory and > > > >> memory devices for those configs? there is less than 256GB free in the > > > >> existing 1TB mach-virt memory map anyway. > > > >> > > > >> - is it OK to rely only on device memory to extend the existing 255 GB > > > >> RAM or would we need additional initial memory? device memory usage > > > >> induces a more complex command line so this puts a constraint on upper > > > >> layers. Is it acceptable though? > > > >> > > > >> - I revisited the series so that the max IPA size shift would get > > > >> automatically computed according to the top address reached by the > > > >> device memory, ie. 2TB + (maxram_size - ramsize). So we would not need > > > >> any additional kvm-type or explicit vm-phys-shift option to select the > > > >> correct max IPA shift (or any CPU phys-bits as suggested by Dave). This > > > >> also assumes we don't put anything beyond the device memory. It is OK? > > > >> > > > >> - Igor told me we was concerned about the split-memory RAM model as it > > > >> caused a lot of trouble regarding compat/migration on PC machine. After > > > >> having studied the pc machine code I now wonder if we can compare the > > > >> PC > > > >> compat issues with the ones we could encounter on ARM with the proposed > > > >> split memory model. > > > > that's not the only issue. > > > > > > > > For example since initial memory isn't modeled as a device > > > > (i.e. it's just a plain memory region), there is a bunch of numa > > > > code to deal with it. If initial memory were replaced by pc-dimm, > > > > we would drop some of it and if we deprecated old '-numa mem' we > > > > should be able to drop the most of it (newer '-numa memdev' maps > > > > directly into pc-dimm model). > > > see my comment below. > > > > > > > > > > > >> On PC there are many knobs to tune the RAM layout > > > >> - max_ram_below_4g option tunes how much RAM we want below 4G > > > >> - gigabyte_align to force 3GB versus 3.5GB lowmem limit if ram_size > > > > >> max_ram_below_4g > > > >> - plus the usual ram_size which affects the rest of the initial ram > > > >> - plus the maxram_size, slots which affect the size of the device > > > >> memory > > > >> - the device memory is just behind the initial RAM, aligned to 1GB > > > >> > > > >> Note the inital RAM and the device memory may be disjoint due to > > > >> misalignment of the initial ram size against 1GB > > > >> > > > >> On ARM, we would have 3.0 virt machine supporting only initial RAM from > > > >> 1GB to 256 GB. 3.1 (or beyond ;-)) virt machine would support the same > > > >> initial RAM + device memory from 2TB to 4TB. > > > >> > > > >> With that memory split and the different machine type, I don't see any > > > >> major hurdle with respect to migration. Do I miss something? > > > > Later on someone with a need to punch holes in fixed initial RAM/device > > > > memory, > > > > starts making it complex. > > > Support of host reserved regions is not acked yet but that's a valid > > > argument. > > > > > > > >> Alternative to have a split model is having a floating RAM base for a > > > >> contiguous initial + device memory (contiguity actually depends on > > > >> initial RAM size alignment too). This requires significant changes in > > > >> FW > > > >> and also potentially impacts the legacy virt address map as we need to > > > >> pass the RAM floating base address in some way (using an SRAM at 1GB) > > > >> or > > > >> using fw_cfg. Is it worth the effort? Also, Peter/Laszlo mentioned > > > >> their > > > >> reluctance to move the RAM earlier > > > > Drew is working on it, lets see outcome first. > > > > > > > > We actually may try implement single region that uses pc-dimm for > > > > all memory (including initial) and be still compatible with legacy > > > > layout > > > > as far as legacy mode sticks to the current RAM limit and device memory > > > > region is put at the current RAM base. > > > > When flexible RAM base is available, we will move that region to > > > > non legacy layout at 2TB (or wherever). > > > > > > Oh I did not understand you wanted to also replace the initial memory by > > > device memory. So we would switch from a pure static initial RAM setup > > > to a pure dynamic device memory setup. Looks quite drastic a change to > > > me. As mentionned I am concerned about complexifying the qemu cmd line > > > and I asked livirt guys about the induced pain. > > Converting initial ram to memory device model beyond the current limits > > within single RAM zone, is the reason why flexible RAM idea was brought in. > > That way we'd end up with a single way to instantiate RAM (model after > > bare-metal machines) and possibility to use hotplug/nvdimm/... with initial > > RAM without any huge refactoring (+compat knobs) on top later. > > > > 2 regions solution is easier hack together right now. If there are > > more regions and we leave initial RAM as is (there is no point > > to bother with flexible RAM base) but it won't lead us to uniform > > RAM handling and won't simplify anything. > > > > Considering virt board doesn't have compat RAM layout baggage of x86, > > it only looks drastic, but in reality it might turn out into a simple > > refactoring. > > > > As for complicated CLI, for compat reasons we will be forced to support > > '-m size=!0', we should be able to translate that implicitly into dimm. > > In addition with dimms as initial memory users would have a choice to ditch > > "-numa (mem|memdev)" altogether and do > > -m 0,slots=X,maxmem=Y -device pc-dimm,node=x... > > and related '-numa' would become a compat shim to translate into > > the similar dimm devices set under the hood. > > (looks like too much fantasy :)) > > > > Possible complications on QEMU side I see in handling of legacy '-numa mem'. > > Easiest would be deprecate it and then do conversion or workaround > > it by replacing it with pc-dimm like device that's treated like > > a memory region that we have now. > > And any migration compatibility issues of the naming of the RAMBlocks; > if virt is at the point it cares about that compatibility. That's what I've meant, lets remove migration altogether and make life simpler :) Jokes aside, '-numa memdev' based variant isn't an issue, we would map that memdevs to dimms i.e. RAMBlocks stay the same, but for '-numa mem' or numaless '-m X' we would need to make up a way to create RAMBlocks with the same ids. If whole ARM conversion turns out to be successful, it would be less scary to do the same to x86/ppc/... and drop a bunch of adhoc numa code > > Dave > > > > > > > Thank you for your feedbacks > > > > > > Eric > > > > > > > > > > > > > >> (https://lists.gnu.org/archive/html/qemu-devel/2017-10/msg03172.html). > > > >> > > > >> Your feedbacks on those points are really welcome! > > > >> > > > >> Thanks > > > >> > > > >> Eric > > > >> > > > >>> > > > >>> This series reuses/rebases patches initially submitted by Shameer in > > > >>> [1] > > > >>> and Kwangwoo in [2]. > > > >>> > > > >>> I put all parts all together for consistency and due to dependencies > > > >>> however as soon as the kernel dependency is resolved we can consider > > > >>> upstreaming them separately. > > > >>> > > > >>> Support more than 40b IPA/GPA [ patches 1 - 5 ] > > > >>> ----------------------------------------------- > > > >>> was "[RFC 0/6] KVM/ARM: Dynamic and larger GPA size" > > > >>> > > > >>> At the moment the guest physical address space is limited to 40b > > > >>> due to KVM limitations. [0] bumps this limitation and allows to > > > >>> create a VM with up to 52b GPA address space. > > > >>> > > > >>> With this series, QEMU creates a virt VM with the max IPA range > > > >>> reported by the host kernel or 40b by default. > > > >>> > > > >>> This choice can be overriden by using the -machine kvm-type=<bits> > > > >>> option with bits within [40, 52]. If <bits> are not supported by > > > >>> the host, the legacy 40b value is used. > > > >>> > > > >>> Currently the EDK2 FW also hardcodes the max number of GPA bits to > > > >>> 40. This will need to be fixed. > > > >>> > > > >>> PCDIMM Support [ patches 6 - 11 ] > > > >>> --------------------------------- > > > >>> was "[RFC 0/5] ARM virt: Support PC-DIMM at 2TB" > > > >>> > > > >>> We instantiate the device_memory at 2TB. Using it obviously requires > > > >>> at least 42b of IPA/GPA. While its max capacity is currently limited > > > >>> to 2TB, the actual size depends on the initial guest RAM size and > > > >>> maxmem parameter. > > > >>> > > > >>> Actual hot-plug and hot-unplug of PC-DIMM is not suported due to lack > > > >>> of support of those features in baremetal. > > > >>> > > > >>> NVDIMM support [ patches 12 - 15 ] > > > >>> ---------------------------------- > > > >>> > > > >>> Once the memory hotplug framework is in place it is fairly > > > >>> straightforward to add support for NVDIMM. the machine "nvdimm" option > > > >>> turns the capability on. > > > >>> > > > >>> Best Regards > > > >>> > > > >>> Eric > > > >>> > > > >>> References: > > > >>> > > > >>> [0] [PATCH v3 00/20] arm64: Dynamic & 52bit IPA support > > > >>> https://www.spinics.net/lists/kernel/msg2841735.html > > > >>> > > > >>> [1] [RFC v2 0/6] hw/arm: Add support for non-contiguous iova regions > > > >>> http://patchwork.ozlabs.org/cover/914694/ > > > >>> > > > >>> [2] [RFC PATCH 0/3] add nvdimm support on AArch64 virt platform > > > >>> https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04599.html > > > >>> > > > >>> Tests: > > > >>> - On Cavium Gigabyte, a 48b VM was created. > > > >>> - Migration tests were performed between kernel supporting the > > > >>> feature and destination kernel not suporting it > > > >>> - test with ACPI: to overcome the limitation of EDK2 FW, virt > > > >>> memory map was hacked to move the device memory below 1TB. > > > >>> > > > >>> This series can be found at: > > > >>> https://github.com/eauger/qemu/tree/v2.12.0-dimm-2tb-v3 > > > >>> > > > >>> History: > > > >>> > > > >>> v2 -> v3: > > > >>> - fix pc_q35 and pc_piix compilation error > > > >>> - kwangwoo's email being not valid anymore, remove his address > > > >>> > > > >>> v1 -> v2: > > > >>> - kvm_get_max_vm_phys_shift moved in arch specific file > > > >>> - addition of NVDIMM part > > > >>> - single series > > > >>> - rebase on David's refactoring > > > >>> > > > >>> v1: > > > >>> - was "[RFC 0/6] KVM/ARM: Dynamic and larger GPA size" > > > >>> - was "[RFC 0/5] ARM virt: Support PC-DIMM at 2TB" > > > >>> > > > >>> Best Regards > > > >>> > > > >>> Eric > > > >>> > > > >>> > > > >>> Eric Auger (9): > > > >>> linux-headers: header update for KVM/ARM > > > >>> KVM_ARM_GET_MAX_VM_PHYS_SHIFT > > > >>> hw/boards: Add a MachineState parameter to kvm_type callback > > > >>> kvm: add kvm_arm_get_max_vm_phys_shift > > > >>> hw/arm/virt: support kvm_type property > > > >>> hw/arm/virt: handle max_vm_phys_shift conflicts on migration > > > >>> hw/arm/virt: Allocate device_memory > > > >>> acpi: move build_srat_hotpluggable_memory to generic ACPI source > > > >>> hw/arm/boot: Expose the pmem nodes in the DT > > > >>> hw/arm/virt: Add nvdimm and nvdimm-persistence options > > > >>> > > > >>> Kwangwoo Lee (2): > > > >>> nvdimm: use configurable ACPI IO base and size > > > >>> hw/arm/virt: Add nvdimm hot-plug infrastructure > > > >>> > > > >>> Shameer Kolothum (4): > > > >>> hw/arm/virt: Add memory hotplug framework > > > >>> hw/arm/boot: introduce fdt_add_memory_node helper > > > >>> hw/arm/boot: Expose the PC-DIMM nodes in the DT > > > >>> hw/arm/virt-acpi-build: Add PC-DIMM in SRAT > > > >>> > > > >>> accel/kvm/kvm-all.c | 2 +- > > > >>> default-configs/arm-softmmu.mak | 4 + > > > >>> hw/acpi/aml-build.c | 51 ++++ > > > >>> hw/acpi/nvdimm.c | 28 ++- > > > >>> hw/arm/boot.c | 123 +++++++-- > > > >>> hw/arm/virt-acpi-build.c | 10 + > > > >>> hw/arm/virt.c | 330 > > > >>> ++++++++++++++++++++++--- > > > >>> hw/i386/acpi-build.c | 49 ---- > > > >>> hw/i386/pc_piix.c | 8 +- > > > >>> hw/i386/pc_q35.c | 8 +- > > > >>> hw/ppc/mac_newworld.c | 2 +- > > > >>> hw/ppc/mac_oldworld.c | 2 +- > > > >>> hw/ppc/spapr.c | 2 +- > > > >>> include/hw/acpi/aml-build.h | 3 + > > > >>> include/hw/arm/arm.h | 2 + > > > >>> include/hw/arm/virt.h | 7 + > > > >>> include/hw/boards.h | 2 +- > > > >>> include/hw/mem/nvdimm.h | 12 + > > > >>> include/standard-headers/linux/virtio_config.h | 16 +- > > > >>> linux-headers/asm-mips/unistd.h | 18 +- > > > >>> linux-headers/asm-powerpc/kvm.h | 1 + > > > >>> linux-headers/linux/kvm.h | 16 ++ > > > >>> target/arm/kvm.c | 9 + > > > >>> target/arm/kvm_arm.h | 16 ++ > > > >>> 24 files changed, 597 insertions(+), 124 deletions(-) > > > >>> > > > >> > > > > > > > > > > -- > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK