NVDIMM at 2TB

Igor Mammedov Fri, 05 Oct 2018 01:19:56 -0700

On Thu, 4 Oct 2018 15:16:13 +0100
"Dr. David Alan Gilbert" <dgilb...@redhat.com> wrote:


> * Igor Mammedov (imamm...@redhat.com) wrote:
> > On Thu, 4 Oct 2018 13:32:26 +0200
> > Auger Eric <eric.au...@redhat.com> wrote:
> >   
> > > Hi Igor,
> > > 
> > > On 10/4/18 1:11 PM, Igor Mammedov wrote:  
> > > > On Wed, 3 Oct 2018 15:49:03 +0200
> > > > Auger Eric <eric.au...@redhat.com> wrote:
> > > >     
> > > >> Hi,
> > > >>
> > > >> On 7/3/18 9:19 AM, Eric Auger wrote:    
> > > >>> This series aims at supporting PCDIMM/NVDIMM intantiation in
> > > >>> machvirt at 2TB guest physical address.
> > > >>>
> > > >>> This is achieved in 3 steps:
> > > >>> 1) support more than 40b IPA/GPA
> > > >>> 2) support PCDIMM instantiation
> > > >>> 3) support NVDIMM instantiation      
> > > >>
> > > >> While respinning this series I have some general questions that raise 
> > > >> up
> > > >> when thinking about extending the RAM on mach-virt:
> > > >>
> > > >> At the moment mach-virt offers 255GB max initial RAM starting at 1GB
> > > >> ("-m " option).
> > > >>
> > > >> This series does not touch this initial RAM and only targets to add
> > > >> device memory (usable for PCDIMM, NVDIMM, virtio-mem, virtio-pmem) in
> > > >> 3.1 machine, located at 2TB. 3.0 address map top currently is at 1TB
> > > >> (legacy aarch32 LPAE limit) so it would leave 1TB for IO or PCI. Is it 
> > > >> OK?
> > > >>
> > > >> - Putting device memory at 2TB means only ARMv8/aarch64 would get
> > > >> benefit of it. Is it an issue? ie. no device memory for ARMv7 or
> > > >> ARMv8/aarch32. Do we need to put effort supporting more memory and
> > > >> memory devices for those configs? there is less than 256GB free in the
> > > >> existing 1TB mach-virt memory map anyway.
> > > >>
> > > >> - is it OK to rely only on device memory to extend the existing 255 GB
> > > >> RAM or would we need additional initial memory? device memory usage
> > > >> induces a more complex command line so this puts a constraint on upper
> > > >> layers. Is it acceptable though?
> > > >>
> > > >> - I revisited the series so that the max IPA size shift would get
> > > >> automatically computed according to the top address reached by the
> > > >> device memory, ie. 2TB + (maxram_size - ramsize). So we would not need
> > > >> any additional kvm-type or explicit vm-phys-shift option to select the
> > > >> correct max IPA shift (or any CPU phys-bits as suggested by Dave). This
> > > >> also assumes we don't put anything beyond the device memory. It is OK?
> > > >>
> > > >> - Igor told me we was concerned about the split-memory RAM model as it
> > > >> caused a lot of trouble regarding compat/migration on PC machine. After
> > > >> having studied the pc machine code I now wonder if we can compare the 
> > > >> PC
> > > >> compat issues with the ones we could encounter on ARM with the proposed
> > > >> split memory model.    
> > > > that's not the only issue.
> > > > 
> > > > For example since initial memory isn't modeled as a device
> > > > (i.e. it's just a plain memory region), there is a bunch of numa
> > > > code to deal with it. If initial memory were replaced by pc-dimm,
> > > > we would drop some of it and if we deprecated old '-numa mem' we
> > > > should be able to drop the most of it (newer '-numa memdev' maps
> > > > directly into pc-dimm model).    
> > > see my comment below.  
> > > > 
> > > >      
> > > >> On PC there are many knobs to tune the RAM layout
> > > >> - max_ram_below_4g option tunes how much RAM we want below 4G
> > > >> - gigabyte_align to force 3GB versus 3.5GB lowmem limit if ram_size >
> > > >> max_ram_below_4g
> > > >> - plus the usual ram_size which affects the rest of the initial ram
> > > >> - plus the maxram_size, slots which affect the size of the device 
> > > >> memory
> > > >> - the device memory is just behind the initial RAM, aligned to 1GB
> > > >>
> > > >> Note the inital RAM and the device memory may be disjoint due to
> > > >> misalignment of the initial ram size against 1GB
> > > >>
> > > >> On ARM, we would have 3.0 virt machine supporting only initial RAM from
> > > >> 1GB to 256 GB. 3.1 (or beyond ;-)) virt machine would support the same
> > > >> initial RAM + device memory from 2TB to 4TB.
> > > >>
> > > >> With that memory split and the different machine type, I don't see any
> > > >> major hurdle with respect to migration. Do I miss something?    
> > > > Later on someone with a need to punch holes in fixed initial RAM/device 
> > > > memory,
> > > > starts making it complex.    
> > > Support of host reserved regions is not acked yet but that's a valid
> > > argument.  
> > > >     
> > > >> Alternative to have a split model is having a floating RAM base for a
> > > >> contiguous initial + device memory (contiguity actually depends on
> > > >> initial RAM size alignment too). This requires significant changes in 
> > > >> FW
> > > >> and also potentially impacts the legacy virt address map as we need to
> > > >> pass the RAM floating base address in some way (using an SRAM at 1GB) 
> > > >> or
> > > >> using fw_cfg. Is it worth the effort? Also, Peter/Laszlo mentioned 
> > > >> their
> > > >> reluctance to move the RAM earlier    
> > > > Drew is working on it, lets see outcome first.
> > > > 
> > > > We actually may try implement single region that uses pc-dimm for
> > > > all memory (including initial) and be still compatible with legacy 
> > > > layout
> > > > as far as legacy mode sticks to the current RAM limit and device memory
> > > > region is put at the current RAM base.
> > > > When flexible RAM base is available, we will move that region to
> > > > non legacy layout at 2TB (or wherever).    
> > > 
> > > Oh I did not understand you wanted to also replace the initial memory by
> > > device memory. So we would switch from a pure static initial RAM setup
> > > to a pure dynamic device memory setup. Looks quite drastic a change to
> > > me. As mentionned I am concerned about complexifying the qemu cmd line
> > > and I asked livirt guys about the induced pain.  
> > Converting initial ram to memory device model beyond the current limits
> > within single RAM zone, is the reason why flexible RAM idea was brought in.
> > That way we'd end up with a single way to instantiate RAM (model after
> > bare-metal machines) and possibility to use hotplug/nvdimm/... with initial
> > RAM without any huge refactoring (+compat knobs) on top later.
> > 
> > 2 regions solution is easier hack together right now. If there are
> > more regions and we leave initial RAM as is (there is no point
> > to bother with flexible RAM base) but it won't lead us to uniform
> > RAM handling and won't simplify anything.
> > 
> > Considering virt board doesn't have compat RAM layout baggage of x86,
> > it only looks drastic, but in reality it might turn out into a simple
> > refactoring.
> > 
> > As for complicated CLI, for compat reasons we will be forced to support
> > '-m size=!0', we should be able to translate that implicitly into dimm.
> > In addition with dimms as initial memory users would have a choice to ditch
> > "-numa (mem|memdev)" altogether and do
> >   -m 0,slots=X,maxmem=Y -device pc-dimm,node=x...
> > and related '-numa' would become a compat shim to translate into
> > the similar dimm devices set under the hood.
> > (looks like too much fantasy :))
> > 
> > Possible complications on QEMU side I see in handling of legacy '-numa mem'.
> > Easiest would be deprecate it and then do conversion or workaround
> > it by replacing it with pc-dimm like device that's treated like
> > a memory region that we have now.  
> 
> And any migration compatibility issues of the naming of the RAMBlocks;
> if virt is at the point it cares about that compatibility.
That's what I've meant, lets remove migration altogether and make life simpler 
:)

Jokes aside, '-numa memdev' based variant isn't an issue, we would map 
that memdevs to dimms i.e. RAMBlocks stay the same,
but for '-numa mem' or numaless '-m X' we would need to make up a way
to create RAMBlocks with the same ids.

If whole ARM conversion turns out to be successful, it would be less scary
to do the same to x86/ppc/... and drop a bunch of adhoc numa code

> 
> Dave
> 
> > > 
> > > Thank you for your feedbacks
> > > 
> > > Eric
> > > 
> > >   
> > > >     
> > > >> (https://lists.gnu.org/archive/html/qemu-devel/2017-10/msg03172.html).
> > > >>
> > > >> Your feedbacks on those points are really welcome!
> > > >>
> > > >> Thanks
> > > >>
> > > >> Eric
> > > >>    
> > > >>>
> > > >>> This series reuses/rebases patches initially submitted by Shameer in 
> > > >>> [1]
> > > >>> and Kwangwoo in [2].
> > > >>>
> > > >>> I put all parts all together for consistency and due to dependencies
> > > >>> however as soon as the kernel dependency is resolved we can consider
> > > >>> upstreaming them separately.
> > > >>>
> > > >>> Support more than 40b IPA/GPA [ patches 1 - 5 ]
> > > >>> -----------------------------------------------
> > > >>> was "[RFC 0/6] KVM/ARM: Dynamic and larger GPA size"
> > > >>>
> > > >>> At the moment the guest physical address space is limited to 40b
> > > >>> due to KVM limitations. [0] bumps this limitation and allows to
> > > >>> create a VM with up to 52b GPA address space.
> > > >>>
> > > >>> With this series, QEMU creates a virt VM with the max IPA range
> > > >>> reported by the host kernel or 40b by default.
> > > >>>
> > > >>> This choice can be overriden by using the -machine kvm-type=<bits>
> > > >>> option with bits within [40, 52]. If <bits> are not supported by
> > > >>> the host, the legacy 40b value is used.
> > > >>>
> > > >>> Currently the EDK2 FW also hardcodes the max number of GPA bits to
> > > >>> 40. This will need to be fixed.
> > > >>>
> > > >>> PCDIMM Support [ patches 6 - 11 ]
> > > >>> ---------------------------------
> > > >>> was "[RFC 0/5] ARM virt: Support PC-DIMM at 2TB"
> > > >>>
> > > >>> We instantiate the device_memory at 2TB. Using it obviously requires
> > > >>> at least 42b of IPA/GPA. While its max capacity is currently limited
> > > >>> to 2TB, the actual size depends on the initial guest RAM size and
> > > >>> maxmem parameter.
> > > >>>
> > > >>> Actual hot-plug and hot-unplug of PC-DIMM is not suported due to lack
> > > >>> of support of those features in baremetal.
> > > >>>
> > > >>> NVDIMM support [ patches 12 - 15 ]
> > > >>> ----------------------------------
> > > >>>
> > > >>> Once the memory hotplug framework is in place it is fairly
> > > >>> straightforward to add support for NVDIMM. the machine "nvdimm" option
> > > >>> turns the capability on.
> > > >>>
> > > >>> Best Regards
> > > >>>
> > > >>> Eric
> > > >>>
> > > >>> References:
> > > >>>
> > > >>> [0] [PATCH v3 00/20] arm64: Dynamic & 52bit IPA support
> > > >>> https://www.spinics.net/lists/kernel/msg2841735.html
> > > >>>
> > > >>> [1] [RFC v2 0/6] hw/arm: Add support for non-contiguous iova regions
> > > >>> http://patchwork.ozlabs.org/cover/914694/
> > > >>>
> > > >>> [2] [RFC PATCH 0/3] add nvdimm support on AArch64 virt platform
> > > >>> https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04599.html
> > > >>>
> > > >>> Tests:
> > > >>> - On Cavium Gigabyte, a 48b VM was created.
> > > >>> - Migration tests were performed between kernel supporting the
> > > >>>   feature and destination kernel not suporting it
> > > >>> - test with ACPI: to overcome the limitation of EDK2 FW, virt
> > > >>>   memory map was hacked to move the device memory below 1TB.
> > > >>>
> > > >>> This series can be found at:
> > > >>> https://github.com/eauger/qemu/tree/v2.12.0-dimm-2tb-v3
> > > >>>
> > > >>> History:
> > > >>>
> > > >>> v2 -> v3:
> > > >>> - fix pc_q35 and pc_piix compilation error
> > > >>> - kwangwoo's email being not valid anymore, remove his address
> > > >>>
> > > >>> v1 -> v2:
> > > >>> - kvm_get_max_vm_phys_shift moved in arch specific file
> > > >>> - addition of NVDIMM part
> > > >>> - single series
> > > >>> - rebase on David's refactoring
> > > >>>
> > > >>> v1:
> > > >>> - was "[RFC 0/6] KVM/ARM: Dynamic and larger GPA size"
> > > >>> - was "[RFC 0/5] ARM virt: Support PC-DIMM at 2TB"
> > > >>>
> > > >>> Best Regards
> > > >>>
> > > >>> Eric
> > > >>>
> > > >>>
> > > >>> Eric Auger (9):
> > > >>>   linux-headers: header update for KVM/ARM 
> > > >>> KVM_ARM_GET_MAX_VM_PHYS_SHIFT
> > > >>>   hw/boards: Add a MachineState parameter to kvm_type callback
> > > >>>   kvm: add kvm_arm_get_max_vm_phys_shift
> > > >>>   hw/arm/virt: support kvm_type property
> > > >>>   hw/arm/virt: handle max_vm_phys_shift conflicts on migration
> > > >>>   hw/arm/virt: Allocate device_memory
> > > >>>   acpi: move build_srat_hotpluggable_memory to generic ACPI source
> > > >>>   hw/arm/boot: Expose the pmem nodes in the DT
> > > >>>   hw/arm/virt: Add nvdimm and nvdimm-persistence options
> > > >>>
> > > >>> Kwangwoo Lee (2):
> > > >>>   nvdimm: use configurable ACPI IO base and size
> > > >>>   hw/arm/virt: Add nvdimm hot-plug infrastructure
> > > >>>
> > > >>> Shameer Kolothum (4):
> > > >>>   hw/arm/virt: Add memory hotplug framework
> > > >>>   hw/arm/boot: introduce fdt_add_memory_node helper
> > > >>>   hw/arm/boot: Expose the PC-DIMM nodes in the DT
> > > >>>   hw/arm/virt-acpi-build: Add PC-DIMM in SRAT
> > > >>>
> > > >>>  accel/kvm/kvm-all.c                            |   2 +-
> > > >>>  default-configs/arm-softmmu.mak                |   4 +
> > > >>>  hw/acpi/aml-build.c                            |  51 ++++
> > > >>>  hw/acpi/nvdimm.c                               |  28 ++-
> > > >>>  hw/arm/boot.c                                  | 123 +++++++--
> > > >>>  hw/arm/virt-acpi-build.c                       |  10 +
> > > >>>  hw/arm/virt.c                                  | 330 
> > > >>> ++++++++++++++++++++++---
> > > >>>  hw/i386/acpi-build.c                           |  49 ----
> > > >>>  hw/i386/pc_piix.c                              |   8 +-
> > > >>>  hw/i386/pc_q35.c                               |   8 +-
> > > >>>  hw/ppc/mac_newworld.c                          |   2 +-
> > > >>>  hw/ppc/mac_oldworld.c                          |   2 +-
> > > >>>  hw/ppc/spapr.c                                 |   2 +-
> > > >>>  include/hw/acpi/aml-build.h                    |   3 +
> > > >>>  include/hw/arm/arm.h                           |   2 +
> > > >>>  include/hw/arm/virt.h                          |   7 +
> > > >>>  include/hw/boards.h                            |   2 +-
> > > >>>  include/hw/mem/nvdimm.h                        |  12 +
> > > >>>  include/standard-headers/linux/virtio_config.h |  16 +-
> > > >>>  linux-headers/asm-mips/unistd.h                |  18 +-
> > > >>>  linux-headers/asm-powerpc/kvm.h                |   1 +
> > > >>>  linux-headers/linux/kvm.h                      |  16 ++
> > > >>>  target/arm/kvm.c                               |   9 +
> > > >>>  target/arm/kvm_arm.h                           |  16 ++
> > > >>>  24 files changed, 597 insertions(+), 124 deletions(-)
> > > >>>       
> > > >>    
> > > >     
> > >   
> >   
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [Qemu-devel] [RFC v3 00/15] ARM virt: PCDIMM/NVDIMM at 2TB

Reply via email to