On Thu, Jul 22, 2021 at 02:36:30PM +0200, David Hildenbrand wrote: > Let's sense support and use it for preallocation. MADV_POPULATE_WRITE > does not require a SIGBUS handler, doesn't actually touch page content, > and avoids context switches; it is, therefore, faster and easier to handle > than our current approach. > > While MADV_POPULATE_WRITE is, in general, faster than manual > prefaulting, and especially faster with 4k pages, there is still value in > prefaulting using multiple threads to speed up preallocation. > > More details on MADV_POPULATE_WRITE can be found in the Linux commit > 4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault > page tables") and in the man page proposal [1]. > > [1] https://lkml.kernel.org/r/20210712083917.16361-1-da...@redhat.com > > This resolves the TODO in do_touch_pages(). > > In the future, we might want to look into using fallocate(), eventually > combined with MADV_POPULATE_READ, when dealing with shared file > mappings. > > Reviewed-by: Pankaj Gupta <pankaj.gu...@ionos.com> > Signed-off-by: David Hildenbrand <da...@redhat.com> > --- > include/qemu/osdep.h | 7 ++++ > util/oslib-posix.c | 88 +++++++++++++++++++++++++++++++++----------- > 2 files changed, 74 insertions(+), 21 deletions(-)
> @@ -497,6 +493,31 @@ static void *do_touch_pages(void *arg) > return NULL; > } > > +static void *do_madv_populate_write_pages(void *arg) > +{ > + MemsetThread *memset_args = (MemsetThread *)arg; > + const size_t size = memset_args->numpages * memset_args->hpagesize; > + char * const addr = memset_args->addr; > + int ret; > + > + if (!size) { > + return NULL; > + } > + > + /* See do_touch_pages(). */ > + qemu_mutex_lock(&page_mutex); > + while (!threads_created_flag) { > + qemu_cond_wait(&page_cond, &page_mutex); > + } > + qemu_mutex_unlock(&page_mutex); > + > + ret = qemu_madvise(addr, size, QEMU_MADV_POPULATE_WRITE); > + if (ret) { > + memset_thread_failed = true; I'm wondering if this use of memset_thread_failed is sufficient. This is pre-existing from the current impl, and ends up being used to set the bool result of 'touch_all_pages'. The caller of that then does if (touch_all_pages(area, hpagesize, numpages, smp_cpus)) { error_setg(errp, "os_mem_prealloc: Insufficient free host memory " "pages available to allocate guest RAM"); } this was reasonable with the old impl, because the only reason we ever see 'memset_thread_failed==true' is if we got SIGBUS due to ENOMEM. My concern is that madvise() has a bunch of possible errno codes returned on failure, and we're not distinguishing them. In the past this kind of thing has burnt us making failures hard to debug. Could we turn 'bool memset_thread_failed' into 'int memset_thread_errno' Then, we can make 'touch_all_pages' have an 'Error **errp' parameter, and it can directly call error_setg_errno(errp, memset_thead_errno, ....some message...) when memset_thread_errno is non-zero, and thus we can remove the generic message from the caller of touch_all_pages. If you agree, it'd be best to refactor the existing code to use this pattern in an initial patch. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|