On Thu, Jun 22, 2017 at 01:14:58PM +0100, Dr. David Alan Gilbert wrote: > * Eduardo Habkost (ehabk...@redhat.com) wrote: > > The new flag will make qemu_ram_free() discard the contents of the > > block. It will be used to let QEMU be configured to avoid flushing file > > contents to disk when exiting. As MADV_REMOVE is not always supported, > > the new code will try MADV_NOTNEEDED in case MADV_REMOVE fails. > > I'd like to understand what semantics you're trying to achieve and thus > why you prefer REMOVE to DONTNEED. If you're trying to avoid changes > being written back then doesn't a DONTNEED get rid of any changes that > have yet to be written? Or are there changes that have already been > queued that REMOVE will kill off? >
Generally speaking, it look(ed) like REMOVE is a superset of DONTNEED: DONTNEED will free and zero pages only on anonymous private mappings; REMOVE will free resources and zero pages on additional cases. One case where I can think REMOVE would be useful is tmpfs when swapping is involved: with REMOVE, the host can drop swap contents or avoid writing memory contents to swap even if we are using a shared tmpfs mapping. Other filesystems might have similar cases where unnecessary I/O operations might be performed even after madvise(MADV_DONTNEED) is called. MADV_REMOVE lets us simply tell the kernel to drop the data. I'm CCing Zack Cornelius, who initially suggested MADV_REMOVE, in case he can describe more specific use cases. > If you're just trying to save-time in writeback, it's interesting to > note my requirement is that by the time I exit this function the > process of throwing away the memory contents must be complete; > I think your requirements are a lot lazier as to when it happens. This is a very good point. I was assuming that REMOVE is a superset of DONTNEED, but based on the manpage it doesn't seem to be guaranteed. Probably I shouldn't try to reuse ram_block_discard_range() and write a separate helper for madvise(MADV_REMOVE), as the requirements are different. > > The new flag will also indicate that ram_block_discard_range() can use > > MADV_REMOVE when discarding memory pages. I have considered calling > > MADV_REMOVE unconditionally (as destroying the RAM contents seems to be > > OK every time ram_block_discard_range() is called), but for safety I > > decided to restrict the new code to blocks having RAM_NONPERSISTENT set. > > The manpage on MADV_REMOVE is confusing; it says it doesn't work on Huge > TLB pages, but says it does work on anything that can do > FALLOC_FL_PUNCH_HOLE - which as far as I can tell hugetlbfs does. Yes, it's confusing. I need to do some testing to find out if HugeTLBFS supports MADV_REMOVE today. But my use case is just an optimization, so it won't be a big deal if it doesn't cover every case in the first version. > > I've got some code in my shared-postcopy world that has this function do > the following which is kind of similar: > > /* The logic here is messy; > * madvise DONTNEED fails for hugepages > * fallocate works on hugepages and shmem > */ > need_madvise = (rb->page_size == qemu_host_page_size) && > (rb->fd == -1 || !(rb->flags & RAM_SHARED)); > need_fallocate = rb->fd != -1; This looks safer to me. I was bothered by the missing check for (rb->fd != -1) in the current code. > if (ret == -1 && need_fallocate) { > #ifdef CONFIG_FALLOCATE_PUNCH_HOLE > ret = fallocate(rb->fd, FALLOC_FL_PUNCH_HOLE | > FALLOC_FL_KEEP_SIZE, > start, length); > #endif > } > if (need_madvise && (!need_fallocate || (ret == 0))) { I'm confused by the (ret == 0) check here. Do you still want to call madvise() if fallocate() succeeded? > #if defined(CONFIG_MADVISE) > ret = madvise(host_startaddr, length, MADV_DONTNEED); > fprintf(stderr, "%s: Did madvise for %p got %d\n", __func__, > host_startaddr, ret); > #endif > } Anyway, now I'm considering simply not touching ram_block_discard_range() and adding a new helper, because the requirements are different. Maybe in the future we can make the two functions share code, if we decide FALLOC_FL_PUNCH_HOLE will be useful for RAM_NONPERSISTENT too. (BTW, I will probably rename "persistent=no"/RAM_NONPERSISTENT to something more explicit about data being dropped, like "free-on-exit=yes" or "disposable=yes"). > > Dave > > > Signed-off-by: Eduardo Habkost <ehabk...@redhat.com> > > --- > > exec.c | 17 ++++++++++++++++- > > 1 file changed, 16 insertions(+), 1 deletion(-) > > > > diff --git a/exec.c b/exec.c > > index 585d6ed6d7..a6e9ed4ece 100644 > > --- a/exec.c > > +++ b/exec.c > > @@ -102,6 +102,11 @@ static MemoryRegion io_mem_unassigned; > > */ > > #define RAM_RESIZEABLE (1 << 2) > > > > +/* RAMBlock contents are not persistent, and we can discard memory contents > > + * when freeing the memory block. > > + */ > > +#define RAM_NONPERSISTENT (1 << 3) > > + > > #endif > > > > #ifdef TARGET_PAGE_BITS_VARY > > @@ -2061,6 +2066,10 @@ void qemu_ram_free(RAMBlock *block) > > ram_block_notify_remove(block->host, block->max_length); > > } > > > > + if (block->flags & RAM_NONPERSISTENT) { > > + ram_block_discard_range(block, 0, block->max_length); > > + } > > + > > qemu_mutex_lock_ramlist(); > > QLIST_REMOVE_RCU(block, next); > > ram_list.mru_block = NULL; > > @@ -3537,7 +3546,13 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t > > start, size_t length) > > /* Note: We need the madvise MADV_DONTNEED behaviour of > > definitely > > * freeing the page. > > */ > > - ret = madvise(host_startaddr, length, MADV_DONTNEED); > > + if (rb->flags & RAM_NONPERSISTENT) { > > + ret = madvise(host_startaddr, length, MADV_REMOVE); > > + } > > + /* Fallback to MADV_DONTNEED if MADV_REMOVE fails */ > > + if (ret || !(rb->flags & RAM_NONPERSISTENT)) { > > + ret = madvise(host_startaddr, length, MADV_DONTNEED); > > + } > > #endif > > } else { > > /* Huge page case - unfortunately it can't do DONTNEED, but > > -- > > 2.11.0.259.g40922b1 > > > -- > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK -- Eduardo