On Tue, May 16, 2017 at 12:34:16PM +0100, Dr. David Alan Gilbert wrote: > * Alexey Perevalov (a.pereva...@samsung.com) wrote: > > This patch provides blocktime calculation per vCPU, > > as a summary and as a overlapped value for all vCPUs. > > > > This approach was suggested by Peter Xu, as an improvements of > > previous approch where QEMU kept tree with faulted page address and cpus > > bitmask > > in it. Now QEMU is keeping array with faulted page address as value and vCPU > > as index. It helps to find proper vCPU at UFFD_COPY time. Also it keeps > > list for blocktime per vCPU (could be traced with page_fault_addr) > > > > Blocktime will not calculated if postcopy_blocktime field of > > MigrationIncomingState wasn't initialized. > > > > Signed-off-by: Alexey Perevalov <a.pereva...@samsung.com> > > I have some multi-threading/ordering worries still. > > The fault thread receives faults over the ufd and calls > mark_postcopy_blocktime_being. That's fine. > > The receiving thread receives pages, calls place page, and > calls mark_postcopy_blocktime_end. That's also fine. > > However, remember that we send pages from the source without > them being requested as background transfers; consider: > > > Source receive-thread fault-thread > > 1 Send A > 2 Receive A > 3 Access A > 4 Report on UFD > 5 Place > 6 Read UFD entry > > > Placing and reading UFD race - and up till now that's been fine; > so we can read off the ufd an address that's already on it's way from > the source, and which we might just be receiving, or that we might > have already placed. > > In this code at (6) won't you call mark_postcopy_blocktime_start > even though it's already been placed at (5) ? Then that blocktime > will stay set until the end of the run? Could you clarify, what does it mean "Read UFD entry", Place - it's postcopy_place_page.
> > Perhaps that's not a problem; if mark_postcopy_blocktime_end is called > for a different address it wont count the blocktime; and when > mark_postcopy_blocktime_start is called for a different address it'll > remove the addres that was a problem above - so perhaps that's fine? mark_postcopy_blocktime_begin doesn't clear state, it only sets the state. Looks like I imaging the race nature: kernel reports about pagefault for page, which are in the middle of the copying process, so we will never copy it again, and there is a chance it will be in vcpu_addr forever. Source receive-thread fault-thread 4 Report on UFD 5 Place ioctl(UFFD_COPY) 5.1 mark_postcopy_blocktime_end 4.1 mark_postcopy_blocktime_begin I think that possible, but probability is too low, probability is increasing for small page size such as 4K pages, ioctl(UFFD_COPY) copies memory like memcpy doing, so time complexity of copying inside ioctl depends on page size. I think to add logic to check *_blocktime_begin in case of *_blocktime_end for that page before. Just for not keeping addr in the vcpu_addr forever. > > > > --- > > migration/postcopy-ram.c | 87 > > +++++++++++++++++++++++++++++++++++++++++++++++- > > migration/trace-events | 5 ++- > > 2 files changed, 90 insertions(+), 2 deletions(-) > > > > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c > > index a1f1705..e2660ae 100644 > > --- a/migration/postcopy-ram.c > > +++ b/migration/postcopy-ram.c > > @@ -23,6 +23,7 @@ > > #include "migration/postcopy-ram.h" > > #include "sysemu/sysemu.h" > > #include "sysemu/balloon.h" > > +#include <sys/param.h> > > #include "qemu/error-report.h" > > #include "trace.h" > > > > @@ -542,6 +543,86 @@ static int ram_block_enable_notify(const char > > *block_name, void *host_addr, > > return 0; > > } > > > > +static int get_mem_fault_cpu_index(uint32_t pid) > > +{ > > + CPUState *cpu_iter; > > + > > + CPU_FOREACH(cpu_iter) { > > + if (cpu_iter->thread_id == pid) { > > + return cpu_iter->cpu_index; > > + } > > + } > > + trace_get_mem_fault_cpu_index(pid); > > + return -1; > > +} > > + > > +static void mark_postcopy_blocktime_begin(uint64_t addr, int cpu) > > +{ > > + MigrationIncomingState *mis = migration_incoming_get_current(); > > + PostcopyBlocktimeContext *dc; > > + int64_t now_ms; > > + if (!mis->blocktime_ctx || cpu < 0) { > > + return; > > + } > > You might consider: > > PostcopyBlocktimeContext *dc = mis->blocktime_ctx; > int64_t now_ms; > if (!dc || cpu < 0) { > return; > } > > it gets rid of the two reads of mis->blocktime_ctx > (You do something similar in a few places) > > > + now_ms = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); > > + dc = mis->blocktime_ctx; > > + if (dc->vcpu_addr[cpu] == 0) { > > + atomic_inc(&dc->smp_cpus_down); > > + } > > + > > + atomic_xchg__nocheck(&dc->vcpu_addr[cpu], addr); > > + atomic_xchg__nocheck(&dc->last_begin, now_ms); > > + atomic_xchg__nocheck(&dc->page_fault_vcpu_time[cpu], now_ms); > > + > > + trace_mark_postcopy_blocktime_begin(addr, dc, > > dc->page_fault_vcpu_time[cpu], > > + cpu); > > +} > > + > > +static void mark_postcopy_blocktime_end(uint64_t addr) > > +{ > > + MigrationIncomingState *mis = migration_incoming_get_current(); > > + PostcopyBlocktimeContext *dc; > > + int i, affected_cpu = 0; > > + int64_t now_ms; > > + bool vcpu_total_blocktime = false; > > + > > + if (!mis->blocktime_ctx) { > > + return; > > + } > > + dc = mis->blocktime_ctx; > > + now_ms = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); > > + > > + /* lookup cpu, to clear it, > > + * that algorithm looks straighforward, but it's not > > + * optimal, more optimal algorithm is keeping tree or hash > > + * where key is address value is a list of */ > > + for (i = 0; i < smp_cpus; i++) { > > + uint64_t vcpu_blocktime = 0; > > + if (atomic_fetch_add(&dc->vcpu_addr[i], 0) != addr) { > > + continue; > > + } > > + atomic_xchg__nocheck(&dc->vcpu_addr[i], 0); > > + vcpu_blocktime = now_ms - > > + atomic_fetch_add(&dc->page_fault_vcpu_time[i], 0); > > + affected_cpu += 1; > > + /* we need to know is that mark_postcopy_end was due to > > + * faulted page, another possible case it's prefetched > > + * page and in that case we shouldn't be here */ > > + if (!vcpu_total_blocktime && > > + atomic_fetch_add(&dc->smp_cpus_down, 0) == smp_cpus) { > > + vcpu_total_blocktime = true; > > + } > > + /* continue cycle, due to one page could affect several vCPUs */ > > + dc->vcpu_blocktime[i] += vcpu_blocktime; > > + } > > + > > + atomic_sub(&dc->smp_cpus_down, affected_cpu); > > + if (vcpu_total_blocktime) { > > + dc->total_blocktime += now_ms - atomic_fetch_add(&dc->last_begin, > > 0); > > This total_blocktime calculation is a little odd; the 'last_begin' is > not necessarily related to the same CPU or same block. last_begin should not be related to the same vCPU, vCPU doesn't matter in this case, due to last_begin is a time when mark_postcopy_blocktime_begin was called (last pagefault), so if we 100% sure here all vCPU is blocked, the time interval since it was blocked starts at last_begin time, even that last_begin was on another vCPU. > > Dave > > > + } > > + trace_mark_postcopy_blocktime_end(addr, dc, dc->total_blocktime); > > +} > > + > > /* > > * Handle faults detected by the USERFAULT markings > > */ > > @@ -619,8 +700,11 @@ static void *postcopy_ram_fault_thread(void *opaque) > > rb_offset &= ~(qemu_ram_pagesize(rb) - 1); > > trace_postcopy_ram_fault_thread_request(msg.arg.pagefault.address, > > qemu_ram_get_idstr(rb), > > - rb_offset); > > + rb_offset, > > + > > msg.arg.pagefault.feat.ptid); > > > > + > > mark_postcopy_blocktime_begin((uintptr_t)(msg.arg.pagefault.address), > > + > > get_mem_fault_cpu_index(msg.arg.pagefault.feat.ptid)); > > /* > > * Send the request to the source - we want to request one > > * of our host page sizes (which is >= TPS) > > @@ -715,6 +799,7 @@ int postcopy_place_page(MigrationIncomingState *mis, > > void *host, void *from, > > > > return -e; > > } > > + mark_postcopy_blocktime_end((uint64_t)(uintptr_t)host); > > > > trace_postcopy_place_page(host); > > return 0; > > diff --git a/migration/trace-events b/migration/trace-events > > index b8f01a2..9424e3e 100644 > > --- a/migration/trace-events > > +++ b/migration/trace-events > > @@ -110,6 +110,8 @@ process_incoming_migration_co_end(int ret, int ps) > > "ret=%d postcopy-state=%d" > > process_incoming_migration_co_postcopy_end_main(void) "" > > migration_set_incoming_channel(void *ioc, const char *ioctype) "ioc=%p > > ioctype=%s" > > migration_set_outgoing_channel(void *ioc, const char *ioctype, const char > > *hostname) "ioc=%p ioctype=%s hostname=%s" > > +mark_postcopy_blocktime_begin(uint64_t addr, void *dd, int64_t time, int > > cpu) "addr 0x%" PRIx64 " dd %p time %" PRId64 " cpu %d" > > +mark_postcopy_blocktime_end(uint64_t addr, void *dd, int64_t time) "addr > > 0x%" PRIx64 " dd %p time %" PRId64 > > > > # migration/rdma.c > > qemu_rdma_accept_incoming_migration(void) "" > > @@ -186,7 +188,7 @@ postcopy_ram_enable_notify(void) "" > > postcopy_ram_fault_thread_entry(void) "" > > postcopy_ram_fault_thread_exit(void) "" > > postcopy_ram_fault_thread_quit(void) "" > > -postcopy_ram_fault_thread_request(uint64_t hostaddr, const char *ramblock, > > size_t offset) "Request for HVA=%" PRIx64 " rb=%s offset=%zx" > > +postcopy_ram_fault_thread_request(uint64_t hostaddr, const char *ramblock, > > size_t offset, uint32_t pid) "Request for HVA=%" PRIx64 " rb=%s offset=%zx > > %u" > > postcopy_ram_incoming_cleanup_closeuf(void) "" > > postcopy_ram_incoming_cleanup_entry(void) "" > > postcopy_ram_incoming_cleanup_exit(void) "" > > @@ -195,6 +197,7 @@ save_xbzrle_page_skipping(void) "" > > save_xbzrle_page_overflow(void) "" > > ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: > > %" PRIu64 " milliseconds, %d iterations" > > ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration > > %" PRIu64 > > +get_mem_fault_cpu_index(uint32_t pid) "pid %u is not vCPU" > > > > # migration/exec.c > > migration_exec_outgoing(const char *cmd) "cmd=%s" > > -- > > 1.9.1 > > > -- > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK > -- BR Alexey