On Wed, 13 Apr 2016 16:15:25 +0300 "Michael S. Tsirkin" <m...@redhat.com> wrote:
> On Wed, Apr 13, 2016 at 01:52:44PM +0200, Thomas Huth wrote: > > The balloon code currently calls madvise() with TARGET_PAGE_SIZE > > as length parameter, and an address which is directly based on > > the page address supplied by the guest. Since the virtio-balloon > > protocol is always based on 4k based addresses/sizes, no matter > > what the host and guest are using as page size, this has a couple > > of issues which could even lead to data loss in the worst case. > > > > TARGET_PAGE_SIZE might not be 4k, so it is wrong to use that > > value for the madvise() call. If TARGET_PAGE_SIZE is bigger than > > 4k, we also destroy the 4k areas after the current one - which > > might be wrong since the guest did not want free that area yet (in > > case the guest used as smaller MMU page size than the hard-coded > > TARGET_PAGE_SIZE). So to fix this issue, introduce a proper define > > called BALLOON_PAGE_SIZE (which is 4096) to use this as the size > > parameter for the madvise() call instead. > > this makes absolute sense. > > > > Then, there's yet another problem: If the host page size is bigger > > than the 4k balloon page size, we can not simply call madvise() on > > each of the 4k balloon addresses that we get from the guest - since > > the madvise() always evicts the whole host page, not only a 4k area! > > Does it really round length up? > Wow, it does: > len = (len_in + ~PAGE_MASK) & PAGE_MASK; > > which seems to be undocumented, but has been there forever. > > > > So in this case, we've got to track the 4k fragments of a host page > > and only call madvise(DONTNEED) when all fragments have been collected. > > This of course only works fine if the guest sends consecutive 4k > > fragments - which is the case in the most important scenarios that > > I try to address here (like a ppc64 guest with 64k page size that > > is running on a ppc64 host with 64k page size). In case the guest > > uses a page size that is smaller than the host page size, we might > > need to add some more additional logic here later to increase the > > probability of being able to release memory, but at least the guest > > should now not crash anymore due to unintentionally evicted pages. > > > > Signed-off-by: Thomas Huth <th...@redhat.com> > > --- > > I've tested this patch with both, a 4k page size ppc64 guest > > and a 64k page size ppc64 guest on a 64k page size ppc64 host. > > With this patch applied, I was not able to crash to crash the > > guests anymore (the 4k guest easily crashes without this patch). > > And looking at the output of the "free" command on the host, > > the ballooning also still works as expected. > > > > hw/virtio/virtio-balloon.c | 68 > > ++++++++++++++++++++++++++++++++++---- > > include/hw/virtio/virtio-balloon.h | 3 ++ > > 2 files changed, 65 insertions(+), 6 deletions(-) > > > > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c > > index c74101e..886faa8 100644 > > --- a/hw/virtio/virtio-balloon.c > > +++ b/hw/virtio/virtio-balloon.c > > @@ -35,13 +35,56 @@ > > #include "hw/virtio/virtio-bus.h" > > #include "hw/virtio/virtio-access.h" > > > > -static void balloon_page(void *addr, int deflate) > > +#define BALLOON_PAGE_SIZE (1 << VIRTIO_BALLOON_PFN_SHIFT) > > +#define BALLOON_NO_CURRENT_PAGE ((void *)-1) > > + > > +static void balloon_page(VirtIOBalloon *s, void *addr, int deflate) > > { > > #if defined(__linux__) > > - if (!qemu_balloon_is_inhibited() && (!kvm_enabled() || > > - kvm_has_sync_mmu())) { > > - qemu_madvise(addr, TARGET_PAGE_SIZE, > > - deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED); > > + size_t host_page_size; > > + void *aligned_addr; > > + > > + if (qemu_balloon_is_inhibited() || (kvm_enabled() && > > !kvm_has_sync_mmu())) { > > + return; > > + } > > + > > + host_page_size = getpagesize(); > > + if (host_page_size <= BALLOON_PAGE_SIZE) { > > + qemu_madvise(addr, BALLOON_PAGE_SIZE, > > + deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED); > > + return; > > + } > > + > > + /* Host page size > ballon page size ==> use aligned host address */ > > + aligned_addr = (void *)((uintptr_t)addr & ~(host_page_size - 1)); > > + if (deflate) { > > + /* MADV_WILLNEED is just a hint for the host kernel, the guest > > should > > + * also be able to use the memory again without this call, so let's > > + * only do it for the first, aligned fragment of a host page and > > + * ignore it for the others. > > + */ > > + if (addr == aligned_addr) { > > + qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_WILLNEED); > > + } > > + s->current_addr = BALLOON_NO_CURRENT_PAGE; > > + } else { > > + const int max_frags = host_page_size / BALLOON_PAGE_SIZE; > > + /* If we end up here, that means we want to evict balloon pages, > > but > > + * the host's page size is bigger than the 4k pages from the > > balloon. > > + * Since madvise() only works on the granularity of the host page > > size, > > + * we've got to collect all the 4k fragments from the guest first > > + * before we can issue the MADV_DONTNEED call. > > + */ > > + if (aligned_addr != s->current_addr) { > > + memset(s->fragment_bits, 0, s->fragment_bits_size); > > + s->current_addr = aligned_addr; > > + } > > + set_bit((addr - aligned_addr) / BALLOON_PAGE_SIZE, > > s->fragment_bits); > > + if (find_first_zero_bit(s->fragment_bits, max_frags) == max_frags) > > { > > + qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_DONTNEED); > > + memset(s->fragment_bits, 0, s->fragment_bits_size); > > + s->current_addr = BALLOON_NO_CURRENT_PAGE; > > + } > > } > > #endif > > } > > @@ -240,7 +283,7 @@ static void virtio_balloon_handle_output(VirtIODevice > > *vdev, VirtQueue *vq) > > /* Using memory_region_get_ram_ptr is bending the rules a bit, > > but > > should be OK because we only want a single page. */ > > addr = section.offset_within_region; > > - balloon_page(memory_region_get_ram_ptr(section.mr) + addr, > > + balloon_page(s, memory_region_get_ram_ptr(section.mr) + addr, > > !!(vq == s->dvq)); > > memory_region_unref(section.mr); > > } > > @@ -455,6 +498,13 @@ static void virtio_balloon_device_realize(DeviceState > > *dev, Error **errp) > > s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output); > > s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats); > > > > + if (getpagesize() > BALLOON_PAGE_SIZE) { > > + s->fragment_bits_size = (getpagesize() / BALLOON_PAGE_SIZE > > + + sizeof(long) * 8 - 1) / 8; > > + s->fragment_bits = g_malloc0(s->fragment_bits_size); > > + s->current_addr = BALLOON_NO_CURRENT_PAGE; > > + } > > + > > reset_stats(s); > > > > register_savevm(dev, "virtio-balloon", -1, 1, > > @@ -466,6 +516,7 @@ static void virtio_balloon_device_unrealize(DeviceState > > *dev, Error **errp) > > VirtIODevice *vdev = VIRTIO_DEVICE(dev); > > VirtIOBalloon *s = VIRTIO_BALLOON(dev); > > > > + g_free(s->fragment_bits); > > balloon_stats_destroy_timer(s); > > qemu_remove_balloon_handler(s); > > unregister_savevm(dev, "virtio-balloon", s); > > @@ -480,6 +531,11 @@ static void virtio_balloon_device_reset(VirtIODevice > > *vdev) > > g_free(s->stats_vq_elem); > > s->stats_vq_elem = NULL; > > } > > + > > + if (s->fragment_bits) { > > + memset(s->fragment_bits, 0, s->fragment_bits_size); > > + s->current_addr = BALLOON_NO_CURRENT_PAGE; > > + } > > } > > > > static void virtio_balloon_instance_init(Object *obj) > > diff --git a/include/hw/virtio/virtio-balloon.h > > b/include/hw/virtio/virtio-balloon.h > > index 35f62ac..04b7c0c 100644 > > --- a/include/hw/virtio/virtio-balloon.h > > +++ b/include/hw/virtio/virtio-balloon.h > > @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon { > > int64_t stats_last_update; > > int64_t stats_poll_interval; > > uint32_t host_features; > > + void *current_addr; > > + unsigned long *fragment_bits; > > + int fragment_bits_size; > > } VirtIOBalloon; > > > > #endif > > -- > > 1.8.3.1 > > > It looks like fragment_bits would have to be migrated. Hmm.. do they? If we just ignore migration, isn't the worse that happens that we just keep one extra page allocated. > Which is a lot of complexity. > And work arounds for specific guest behaviour are really ugly. > There are patches on-list to maintain a balloon bitmap - > that will enable fixing it cleanly. > How about we just skip madvise if host page size is > balloon > page size, for 2.6? > > -- > MST -- David Gibson <dgib...@redhat.com> Senior Software Engineer, Virtualization, Red Hat
pgp1YYWlnbzad.pgp
Description: OpenPGP digital signature