Re: [Qemu-devel] [PATCH v2] mem-prealloc: reduce large guest start-up and migration time.

Igor Mammedov Tue, 14 Feb 2017 05:05:16 -0800

On Tue, 14 Feb 2017 11:53:11 +0530
Jitendra Kolhe <jitendra.ko...@hpe.com> wrote:


> On 2/13/2017 9:22 PM, Jitendra Kolhe wrote:
> > On 2/13/2017 5:34 PM, Igor Mammedov wrote:  
> >> On Mon, 13 Feb 2017 11:23:17 +0000
> >> "Daniel P. Berrange" <berra...@redhat.com> wrote:
> >>  
> >>> On Mon, Feb 13, 2017 at 11:45:46AM +0100, Igor Mammedov wrote:  
> >>>> On Mon, 13 Feb 2017 14:30:56 +0530
> >>>> Jitendra Kolhe <jitendra.ko...@hpe.com> wrote:
> >>>>     
> >>>>> Using "-mem-prealloc" option for a large guest leads to higher guest
> >>>>> start-up and migration time. This is because with "-mem-prealloc" option
> >>>>> qemu tries to map every guest page (create address translations), and
> >>>>> make sure the pages are available during runtime. virsh/libvirt by
> >>>>> default, seems to use "-mem-prealloc" option in case the guest is
> >>>>> configured to use huge pages. The patch tries to map all guest pages
> >>>>> simultaneously by spawning multiple threads. Currently limiting the
> >>>>> change to QEMU library functions on POSIX compliant host only, as we are
> >>>>> not sure if the problem exists on win32. Below are some stats with
> >>>>> "-mem-prealloc" option for guest configured to use huge pages.
> >>>>>
> >>>>> ------------------------------------------------------------------------
> >>>>> Idle Guest      | Start-up time | Migration time
> >>>>> ------------------------------------------------------------------------
> >>>>> Guest stats with 2M HugePage usage - single threaded (existing code)
> >>>>> ------------------------------------------------------------------------
> >>>>> 64 Core - 4TB   | 54m11.796s    | 75m43.843s
> >>>>> 64 Core - 1TB   | 8m56.576s     | 14m29.049s
> >>>>> 64 Core - 256GB | 2m11.245s     | 3m26.598s
> >>>>> ------------------------------------------------------------------------
> >>>>> Guest stats with 2M HugePage usage - map guest pages using 8 threads
> >>>>> ------------------------------------------------------------------------
> >>>>> 64 Core - 4TB   | 5m1.027s      | 34m10.565s
> >>>>> 64 Core - 1TB   | 1m10.366s     | 8m28.188s
> >>>>> 64 Core - 256GB | 0m19.040s     | 2m10.148s
> >>>>> -----------------------------------------------------------------------
> >>>>> Guest stats with 2M HugePage usage - map guest pages using 16 threads
> >>>>> -----------------------------------------------------------------------
> >>>>> 64 Core - 4TB   | 1m58.970s     | 31m43.400s
> >>>>> 64 Core - 1TB   | 0m39.885s     | 7m55.289s
> >>>>> 64 Core - 256GB | 0m11.960s     | 2m0.135s
> >>>>> -----------------------------------------------------------------------
> >>>>>
> >>>>> Changed in v2:
> >>>>>  - modify number of memset threads spawned to min(smp_cpus, 16).
> >>>>>  - removed 64GB memory restriction for spawning memset threads.
> >>>>>
> >>>>> Signed-off-by: Jitendra Kolhe <jitendra.ko...@hpe.com>
> >>>>> ---
> >>>>>  backends/hostmem.c   |  4 ++--
> >>>>>  exec.c               |  2 +-
> >>>>>  include/qemu/osdep.h |  3 ++-
> >>>>>  util/oslib-posix.c   | 68 
> >>>>> +++++++++++++++++++++++++++++++++++++++++++++++-----
> >>>>>  util/oslib-win32.c   |  3 ++-
> >>>>>  5 files changed, 69 insertions(+), 11 deletions(-)
> >>>>>
> >>>>> diff --git a/backends/hostmem.c b/backends/hostmem.c
> >>>>> index 7f5de70..162c218 100644
> >>>>> --- a/backends/hostmem.c
> >>>>> +++ b/backends/hostmem.c
> >>>>> @@ -224,7 +224,7 @@ static void host_memory_backend_set_prealloc(Object 
> >>>>> *obj, bool value,
> >>>>>          void *ptr = memory_region_get_ram_ptr(&backend->mr);
> >>>>>          uint64_t sz = memory_region_size(&backend->mr);
> >>>>>  
> >>>>> -        os_mem_prealloc(fd, ptr, sz, &local_err);
> >>>>> +        os_mem_prealloc(fd, ptr, sz, smp_cpus, &local_err);
> >>>>>          if (local_err) {
> >>>>>              error_propagate(errp, local_err);
> >>>>>              return;
> >>>>> @@ -328,7 +328,7 @@ host_memory_backend_memory_complete(UserCreatable 
> >>>>> *uc, Error **errp)
> >>>>>           */
> >>>>>          if (backend->prealloc) {
> >>>>>              os_mem_prealloc(memory_region_get_fd(&backend->mr), ptr, 
> >>>>> sz,
> >>>>> -                            &local_err);
> >>>>> +                            smp_cpus, &local_err);
> >>>>>              if (local_err) {
> >>>>>                  goto out;
> >>>>>              }
> >>>>> diff --git a/exec.c b/exec.c
> >>>>> index 8b9ed73..53afcd2 100644
> >>>>> --- a/exec.c
> >>>>> +++ b/exec.c
> >>>>> @@ -1379,7 +1379,7 @@ static void *file_ram_alloc(RAMBlock *block,
> >>>>>      }
> >>>>>  
> >>>>>      if (mem_prealloc) {
> >>>>> -        os_mem_prealloc(fd, area, memory, errp);
> >>>>> +        os_mem_prealloc(fd, area, memory, smp_cpus, errp);
> >>>>>          if (errp && *errp) {
> >>>>>              goto error;
> >>>>>          }
> >>>>> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> >>>>> index 56c9e22..fb1d22b 100644
> >>>>> --- a/include/qemu/osdep.h
> >>>>> +++ b/include/qemu/osdep.h
> >>>>> @@ -401,7 +401,8 @@ unsigned long qemu_getauxval(unsigned long type);
> >>>>>  
> >>>>>  void qemu_set_tty_echo(int fd, bool echo);
> >>>>>  
> >>>>> -void os_mem_prealloc(int fd, char *area, size_t sz, Error **errp);
> >>>>> +void os_mem_prealloc(int fd, char *area, size_t sz, int smp_cpus,
> >>>>> +                     Error **errp);
> >>>>>  
> >>>>>  int qemu_read_password(char *buf, int buf_size);
> >>>>>  
> >>>>> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> >>>>> index f631464..17da029 100644
> >>>>> --- a/util/oslib-posix.c
> >>>>> +++ b/util/oslib-posix.c
> >>>>> @@ -55,6 +55,16 @@
> >>>>>  #include "qemu/error-report.h"
> >>>>>  #endif
> >>>>>  
> >>>>> +#define MAX_MEM_PREALLOC_THREAD_COUNT 16    
> >>>> running with -smp 16 or bigger on host with less than 16 cpus
> >>>> it would be not quite optimal.
> >>>> Why not to change MAX_MEM_PREALLOC_THREAD_COUNT constant to
> >>>> something like sysconf(_SC_NPROCESSORS_ONLN)    
> >>>
> >>> The point is to not consume more host resources than would otherwise
> >>> be consumed by running the guest CPUs. ie, if running a KVM guest
> >>> with -smp 4 on a 16 CPU host,  QEMU should not to consume more than
> >>> 4 pCPUs worth of resource on the host.  Using sysconf would cause
> >>> the consume to consume all host resources, likely harming other
> >>> guests workloads.
> >>>
> >>> If the person launching QEMU gives a -smp value that's larger than
> >>> the host CPUs count, then they've already accepted that they're
> >>> asking QEMU todo more than the host is really capable of. IOW, I
> >>> don't think we need to special case memsetting for that, since
> >>> VCPU execution itself is already going to overcommit the host.  
> >> Doing over commit at preallocate time doesn't make much sense,
> >> if MAX_MEM_PREALLOC_THREAD_COUNT is replaced with 
> >> sysconf(_SC_NPROCESSORS_ONLN)
> >> then QEMU will end up with MIN(-smp, sysconf(_SC_NPROCESSORS_ONLN))
> >> which will put cap on upper value and avoid useless over commit at
> >> preallocate time.
> >>  
> > 
> > I agree, we should consider case where we run with -smp >= 16 which 
> > is overcommited on host with < 16 cpus. At the same time we should 
> > also be sure that we don't end up spawning to many memset threads. 
> > For e.g. I have been running fat guests with -smp > 64 on hosts 
> > with 384 cpus.
> >   
> how about putting a cap on MAX_MEM_PREALLOC_THREAD_COUNT to 
> (MIN(sysconf(_SC_NPROCESSORS_ONLN), 16))?
> Number of memset threads can be calculated using
> MIN(smp_cpus, MAX_MEM_PREALLOC_THREAD_COUNT);
it looks ok to me

> 
> Thanks,
> - Jitendra
> 
> > Thanks,
> > - Jitendra
> >   
> >>> Regards,
> >>> Daniel  
> >>  
> >

Re: [Qemu-devel] [PATCH v2] mem-prealloc: reduce large guest start-up and migration time.

Reply via email to