On Wed, 13 Apr 2016 13:52:44 +0200 Thomas Huth <th...@redhat.com> wrote:
> The balloon code currently calls madvise() with TARGET_PAGE_SIZE > as length parameter, and an address which is directly based on > the page address supplied by the guest. Since the virtio-balloon > protocol is always based on 4k based addresses/sizes, no matter > what the host and guest are using as page size, this has a couple > of issues which could even lead to data loss in the worst case. > > TARGET_PAGE_SIZE might not be 4k, so it is wrong to use that > value for the madvise() call. If TARGET_PAGE_SIZE is bigger than > 4k, we also destroy the 4k areas after the current one - which > might be wrong since the guest did not want free that area yet (in > case the guest used as smaller MMU page size than the hard-coded > TARGET_PAGE_SIZE). So to fix this issue, introduce a proper define > called BALLOON_PAGE_SIZE (which is 4096) to use this as the size > parameter for the madvise() call instead. > > Then, there's yet another problem: If the host page size is bigger > than the 4k balloon page size, we can not simply call madvise() on > each of the 4k balloon addresses that we get from the guest - since > the madvise() always evicts the whole host page, not only a 4k area! > So in this case, we've got to track the 4k fragments of a host page > and only call madvise(DONTNEED) when all fragments have been collected. > This of course only works fine if the guest sends consecutive 4k > fragments - which is the case in the most important scenarios that > I try to address here (like a ppc64 guest with 64k page size that > is running on a ppc64 host with 64k page size). In case the guest > uses a page size that is smaller than the host page size, we might > need to add some more additional logic here later to increase the > probability of being able to release memory, but at least the guest > should now not crash anymore due to unintentionally evicted pages. Yeah, this is a serious bug. I think the patch is basically ok, except for a couple of minor points noted below. It's a bit more complex than I'd generally like to apply during hard freeze, but given the seriousness of the problem, I think we should put this in for qemu-2.6. > > Signed-off-by: Thomas Huth <th...@redhat.com> > --- > I've tested this patch with both, a 4k page size ppc64 guest > and a 64k page size ppc64 guest on a 64k page size ppc64 host. > With this patch applied, I was not able to crash to crash the > guests anymore (the 4k guest easily crashes without this patch). > And looking at the output of the "free" command on the host, > the ballooning also still works as expected. > > hw/virtio/virtio-balloon.c | 68 > ++++++++++++++++++++++++++++++++++---- > include/hw/virtio/virtio-balloon.h | 3 ++ > 2 files changed, 65 insertions(+), 6 deletions(-) > > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c > index c74101e..886faa8 100644 > --- a/hw/virtio/virtio-balloon.c > +++ b/hw/virtio/virtio-balloon.c > @@ -35,13 +35,56 @@ > #include "hw/virtio/virtio-bus.h" > #include "hw/virtio/virtio-access.h" > > -static void balloon_page(void *addr, int deflate) > +#define BALLOON_PAGE_SIZE (1 << VIRTIO_BALLOON_PFN_SHIFT) > +#define BALLOON_NO_CURRENT_PAGE ((void *)-1) > + > +static void balloon_page(VirtIOBalloon *s, void *addr, int deflate) > { > #if defined(__linux__) > - if (!qemu_balloon_is_inhibited() && (!kvm_enabled() || > - kvm_has_sync_mmu())) { > - qemu_madvise(addr, TARGET_PAGE_SIZE, > - deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED); > + size_t host_page_size; > + void *aligned_addr; > + > + if (qemu_balloon_is_inhibited() || (kvm_enabled() && > !kvm_has_sync_mmu())) { > + return; > + } > + > + host_page_size = getpagesize(); I think you actually want getrampagesize() here (or rather some wrapper around that to handle the non-kvm cases properly) - to cover the case where the VM has been built using hugepages in the host. For the normal case there's already the qemu_real_host_page_size global, so you shouldn't need to call getpagesize() yourself. > + if (host_page_size <= BALLOON_PAGE_SIZE) { > + qemu_madvise(addr, BALLOON_PAGE_SIZE, > + deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED); > + return; > + } > + > + /* Host page size > ballon page size ==> use aligned host address */ > + aligned_addr = (void *)((uintptr_t)addr & ~(host_page_size - 1)); > + if (deflate) { > + /* MADV_WILLNEED is just a hint for the host kernel, the guest should > + * also be able to use the memory again without this call, so let's > + * only do it for the first, aligned fragment of a host page and > + * ignore it for the others. > + */ > + if (addr == aligned_addr) { > + qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_WILLNEED); > + } > + s->current_addr = BALLOON_NO_CURRENT_PAGE; > + } else { > + const int max_frags = host_page_size / BALLOON_PAGE_SIZE; > + /* If we end up here, that means we want to evict balloon pages, but > + * the host's page size is bigger than the 4k pages from the balloon. > + * Since madvise() only works on the granularity of the host page > size, > + * we've got to collect all the 4k fragments from the guest first > + * before we can issue the MADV_DONTNEED call. > + */ > + if (aligned_addr != s->current_addr) { I'd suggest a (non fatal) error_report() here if s->current_addr != BALLOON_NO_CURRENT_PAGE. > + memset(s->fragment_bits, 0, s->fragment_bits_size); > + s->current_addr = aligned_addr; > + } > + set_bit((addr - aligned_addr) / BALLOON_PAGE_SIZE, s->fragment_bits); > + if (find_first_zero_bit(s->fragment_bits, max_frags) == max_frags) { If you invert the sense of the bitmap you can just check for an all-zero bitmap here which might be a little simpler. > + qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_DONTNEED); > + memset(s->fragment_bits, 0, s->fragment_bits_size); > + s->current_addr = BALLOON_NO_CURRENT_PAGE; > + } > } > #endif > } > @@ -240,7 +283,7 @@ static void virtio_balloon_handle_output(VirtIODevice > *vdev, VirtQueue *vq) > /* Using memory_region_get_ram_ptr is bending the rules a bit, > but > should be OK because we only want a single page. */ > addr = section.offset_within_region; > - balloon_page(memory_region_get_ram_ptr(section.mr) + addr, > + balloon_page(s, memory_region_get_ram_ptr(section.mr) + addr, > !!(vq == s->dvq)); > memory_region_unref(section.mr); > } > @@ -455,6 +498,13 @@ static void virtio_balloon_device_realize(DeviceState > *dev, Error **errp) > s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output); > s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats); > > + if (getpagesize() > BALLOON_PAGE_SIZE) { > + s->fragment_bits_size = (getpagesize() / BALLOON_PAGE_SIZE > + + sizeof(long) * 8 - 1) / 8; > + s->fragment_bits = g_malloc0(s->fragment_bits_size); > + s->current_addr = BALLOON_NO_CURRENT_PAGE; > + } > + > reset_stats(s); > > register_savevm(dev, "virtio-balloon", -1, 1, > @@ -466,6 +516,7 @@ static void virtio_balloon_device_unrealize(DeviceState > *dev, Error **errp) > VirtIODevice *vdev = VIRTIO_DEVICE(dev); > VirtIOBalloon *s = VIRTIO_BALLOON(dev); > > + g_free(s->fragment_bits); > balloon_stats_destroy_timer(s); > qemu_remove_balloon_handler(s); > unregister_savevm(dev, "virtio-balloon", s); > @@ -480,6 +531,11 @@ static void virtio_balloon_device_reset(VirtIODevice > *vdev) > g_free(s->stats_vq_elem); > s->stats_vq_elem = NULL; > } > + > + if (s->fragment_bits) { > + memset(s->fragment_bits, 0, s->fragment_bits_size); > + s->current_addr = BALLOON_NO_CURRENT_PAGE; > + } > } > > static void virtio_balloon_instance_init(Object *obj) > diff --git a/include/hw/virtio/virtio-balloon.h > b/include/hw/virtio/virtio-balloon.h > index 35f62ac..04b7c0c 100644 > --- a/include/hw/virtio/virtio-balloon.h > +++ b/include/hw/virtio/virtio-balloon.h > @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon { > int64_t stats_last_update; > int64_t stats_poll_interval; > uint32_t host_features; > + void *current_addr; > + unsigned long *fragment_bits; > + int fragment_bits_size; > } VirtIOBalloon; > > #endif > -- > 1.8.3.1 > -- David Gibson <dgib...@redhat.com> Senior Software Engineer, Virtualization, Red Hat
pgpNOefxe4BMU.pgp
Description: OpenPGP digital signature