On 20.07.20 21:13, Dr. David Alan Gilbert wrote: > (Copies in Dave Hildenbrand) > > * Peter Maydell (peter.mayd...@linaro.org) wrote: >> On Sat, 18 Jul 2020 at 14:21, David CARLIER <devne...@gmail.com> wrote: >>> >>> From a9e3cced279ae55a59847ba232f7828bc2479367 Mon Sep 17 00:00:00 2001 >>> From: David Carlier <devne...@gmail.com> >>> Date: Sat, 18 Jul 2020 13:29:44 +0100 >>> Subject: [PATCH 2/3] exec: posix_madvise usage on SunOS. >>> >>> with _XOPEN_SOURCE set, the older mman.h API based on caddr_t handling >>> is not accessible thus using posix_madvise here. >>> >>> Signed-off-by: David Carlier <devne...@gmail.com> >>> --- >>> exec.c | 8 ++++++++ >>> 1 file changed, 8 insertions(+) >>> >>> diff --git a/exec.c b/exec.c >>> index 6f381f98e2..0466a75b89 100644 >>> --- a/exec.c >>> +++ b/exec.c >>> @@ -3964,7 +3964,15 @@ int ram_block_discard_range(RAMBlock *rb, >>> uint64_t start, size_t length) >>> * fallocate'd away). >>> */ >>> #if defined(CONFIG_MADVISE) >>> +#if !defined(CONFIG_SOLARIS) >>> ret = madvise(host_startaddr, length, MADV_DONTNEED); >>> +#else >>> + /* >>> + * mmap and its caddr_t based api is not accessible >>> + * with _XOPEN_SOURCE set on illumos >>> + */ >>> + ret = posix_madvise(host_startaddr, length, >>> POSIX_MADV_DONTNEED); >>> +#endif >> >> Hi. I'm not sure this patch will do the right thing, because >> I don't think that Solaris's POSIX_MADV_DONTNEED provides >> the semantics that this QEMU function says it needs. The >> comment at the top of the function says: >> >> * Unmap pages of memory from start to start+length such that >> * they a) read as 0, b) Trigger whatever fault mechanism >> * the OS provides for postcopy. >> * The pages must be unmapped by the end of the function. > > This code has moved around a bit over it's life; joining the case > needed by balloon and the case needed by postcopy. > >> (Aside: the use of 'unmap' in this comment is a bit confusing, >> because it clearly doesn't mean 'unmap' if it wants read-as-0. >> And the reference to faults on postcopy is incomprehensible >> to me: if memory is read-as-0 it isn't going to fault.) > > I think because internally to Linux the behaviour is the same; > this causes the mapping to disappear from the TLB so it faults; > normally when reading the kernel resolves the fault and puts > a read-as-zero page there, except if userfault was enabled > for postcopy, in which case it gives us a kick and we service it. > >> Linux's madvise(MADV_DONTNEED) does guarantee us this >> read-as-zero behaviour. (It's a silly API choice that Linux >> put this behaviour behind madvise, which is supposed to be >> merely advisory, but that's how it is.) > > Yes, I don't think there's any equivalent to madvise > that guarantees anything. > >> The Solaris >> posix_madvise() manpage says it is merely advisory and >> doesn't affect the behaviour of accesses to the memory. >> >> If posix_madvise() behaviour was OK in this function, the >> right way to fix this would be to use qemu_madvise() >> instead, which already provides this "if host has >> madvise(), use it, otherwise use posix_madvise()" logic. >> But I suspect that the direct madvise() here is deliberate. > > Yes, but I can't remember the semantics fully - I think it was because > we needed the guarantee at this point (and even Linux's > posix madvise did something different??) > > I've got a note saying we didn't want to use > qemu_madvise because we wanted to be sure we didn't get > posix_madvise. > >> Side note: not sure the current code is correct for the >> BSDs either -- they have madvise() but don't provide >> Linux's really-read-as-zero guarantee for MADV_DONTNEED. >> So we should probably be doing something else there, and >> whatever that something-else is is probably also what >> Solaris wants. >> >> We use ram_block_discard_range() only in migration and >> in virtio-balloon and virtio-mem; I've cc'd some people >> who hopefully understand what the requirements on this >> function are and might have a view on what the not-Linux >> implementation should look like. (David Gilbert: git >> blame says you wrote this code :-))
virtio-mem depends on Linux (hw/virtio/Kconfig). I guess userfaultfd/postcopy is also not relevant in the context of SunOS. So what remains is virtio-balloon. virito-balloon ideally wants to discard the actual mapped pages to free up memory. When memory is re-accessed, a fresh page is faulted in (-> zero-page under Linux). Now, we already have other cases where it looks like "the balloon works" but it really doesn't. One example is using vfio+virtio-balloon under Linux - inflating the balloon is simply a NOP, no memory is actually discarded. I agree that POSIX_MADV_DONTNEED is not a proper match - different guarantees. If SunOS cannot implement ram_block_discard_range() as documented, we should disable it. I would suggest using ram_block_discard_disable(true) when under SunOS early during QEMU startup. In addition, we might want to return directly in ram_block_dicard_range(). We might also want to make virito-balloon depend on !SubOS. -- Thanks, David / dhildenb