On 8/21/19 10:37 AM, David Hildenbrand wrote: > Hah, guess what, I implemented a similar variant of "fetch all > of the host addresses" *but* it is not that easy as you might > think (sorry for the bad news).
I think it is, because I didn't think it *that* easy. :-) > There are certain cases where we can't get access to the raw host > page. Namely, cpu watchpoints, LAP, NODIRTY. In summary: this won't > as you describe. (my first approach did exactly this) NODIRTY and LAP are automatically handled via probe_write faulting instead of returning the address. I think there may be a bug in probe_write at present not checking Watchpoints could be handled the same way, if we were to export check_watchpoint from exec.c. Indeed, I see no way to handle watchpoints correctly if we don't. I think that's an outstanding bug with probe_write. Any other objections? I certainly think that restricting the size of such operations to one page is a large simplification over the S390Access array thing that you create in this patch. r~ > > The following patch requires another re-factoring > (tcg_s390_cpu_mmu_translate), but you should get the idea. > > > > From 0cacd2aea3dbc25e93492cca04f6c866b86d7f8a Mon Sep 17 00:00:00 2001 > From: David Hildenbrand <da...@redhat.com> > Date: Tue, 20 Aug 2019 09:37:11 +0200 > Subject: [PATCH v1] s390x/tcg: Fault-safe MVC (MOVE) implementation > > MVC can cross page boundaries. In case we fault on the second page, we > already partially copied data. If we have overlaps, we would > trigger a fault after having partially moved data, eventually having > our original data already overwritten. When continuing after the fault, > we would try to move already modified data, not the original data - > very bad. > > glibc started to use MVC for forward memmove() and is able to trigger > exectly this corruption (via rpmbuild and rpm). Fedora 31 (rawhide) > currently fails to install as we trigger rpm database corruptions due to > this bug. > > We need a way to translate a virtual address range to individual pages that > we can access later on without triggering faults. Probing all virtual > addresses once before the read/write is not sufficient - the guest could > have modified the page tables (e.g., write-protect, map out) in between, > so on we could fault on any new tlb_fill() - we have to skip any new MMU > translations. > > Unfortunately, there are TLB entries for which cannot get a host address > for (esp., watchpoints, LAP, NOTDIRTY) - in these cases we cannot avoid > a new MMU translation using the ordinary ld/st helpers. Let's fallback > to guest physical addresses in these cases, that we access via > cpu_physical_memory_(read|write), > > This change reduced the boottime for s390x guests (to prompt) from ~1:29 > min to ~1:19 min in my tests. For example, LAP protected pages are now only > translated once when writing to them using MVC and we don't always fallback > to byte-based copies. > > We will want to use the same mechanism for other accesses as well (e.g., > mvcl), prepare for that right away. > > Signed-off-by: David Hildenbrand <da...@redhat.com> > --- > target/s390x/mem_helper.c | 213 +++++++++++++++++++++++++++++++++++--- > 1 file changed, 200 insertions(+), 13 deletions(-) > > diff --git a/target/s390x/mem_helper.c b/target/s390x/mem_helper.c > index 91ba2e03d9..1ca293e00d 100644 > --- a/target/s390x/mem_helper.c > +++ b/target/s390x/mem_helper.c > @@ -24,8 +24,10 @@ > #include "exec/helper-proto.h" > #include "exec/exec-all.h" > #include "exec/cpu_ldst.h" > +#include "exec/cpu-common.h" > #include "qemu/int128.h" > #include "qemu/atomic128.h" > +#include "tcg_s390x.h" > > #if !defined(CONFIG_USER_ONLY) > #include "hw/s390x/storage-keys.h" > @@ -104,6 +106,181 @@ static inline void cpu_stsize_data_ra(CPUS390XState > *env, uint64_t addr, > } > } > > +/* > + * An access covers one page, except for the start/end of the translated > + * virtual address range. > + */ > +typedef struct S390Access { > + union { > + char *haddr; > + hwaddr paddr; > + }; > + uint16_t size; > + bool isHaddr; > +} S390Access; > + > +/* > + * Prepare access to a virtual address range, guaranteeing we won't trigger > + * faults during the actual access. Sometimes we can't get access to the > + * host address (e.g., LAP, cpu watchpoints/PER, clean pages, ...). Then, we > + * translate to guest physical addresses instead. We'll have to perform > + * slower, indirect, access to these physical addresses then. > + */ > +static void access_prepare_idx(CPUS390XState *env, S390Access access[], > + int nb_access, vaddr vaddr, target_ulong size, > + MMUAccessType access_type, int mmu_idx, > + uintptr_t ra) > +{ > + int i = 0; > + int cur_size; > + > + /* > + * After we obtained the host address of a TLB entry that entry might > + * be invalidated again - e.g., via tlb_set_dirty(), via another > + * tlb_fill(). We assume here that it is fine to temporarily store the > + * host address to access it later - we didn't agree to any tlb flush and > + * there seems to be no mechanism protecting the return value of > + * tlb_vaddr_to_host(). > + */ > + while (size) { > + g_assert(i < nb_access); > + cur_size = adj_len_to_page(size, vaddr); > + > + access[i].isHaddr = true; > + access[i].haddr = tlb_vaddr_to_host(env, vaddr, access_type, > mmu_idx); > + if (!access[i].haddr) { > + access[i].isHaddr = false; > + tcg_s390_cpu_mmu_translate(env, vaddr, cur_size, access_type, > + mmu_idx, false, &access[i].paddr, > + NULL, ra); > + } > + access[i].size = cur_size; > + > + vaddr += cur_size; > + size -= cur_size; > + i++; > + } > + > + /* let's zero-out the remaining entries, so we have a size of 0 */ > + if (i < nb_access) { > + memset(&access[i], 0 , sizeof(S390Access) * (nb_access - i)); > + } > +} > + > +static void access_prepare(CPUS390XState *env, S390Access access[], > + int nb_access, target_ulong vaddr, target_ulong > size, > + MMUAccessType access_type, uintptr_t ra) > +{ > + int mmu_idx = cpu_mmu_index(env, false); > + > + access_prepare_idx(env, access, nb_access, vaddr, size, access_type, > + mmu_idx, ra); > +} > + > +static void access_set(CPUS390XState *env, S390Access write[], int nb_write, > + uint8_t byte, target_ulong size) > +{ > + target_ulong cur_size; > + void *buf = NULL; > + int w = 0; > + > + while (size) { > + g_assert(w < nb_write); > + if (!write[w].size) { > + w++; > + continue; > + } > + > + cur_size = MIN(size, write[w].size); > + if (write[w].isHaddr) { > + memset(write[w].haddr, byte, cur_size); > + write[w].haddr += cur_size; > + } else { > +#ifndef CONFIG_USER_ONLY > + if (!buf) { > + buf = g_malloc(TARGET_PAGE_SIZE); > + memset(buf, byte, cur_size); > + } > + cpu_physical_memory_write(write[w].paddr, buf, cur_size); > + write[w].paddr += cur_size; > +#else > + g_assert_not_reached(); > +#endif > + } > + write[w].size -= cur_size; > + size -= cur_size; > + } > + g_free(buf); > +} > + > +/* > + * Copy memory in chunks up to chunk_size. If the ranges don't overlap or > + * if it's a forward move, this function behaves like memmove(). > + * > + * To achieve a backwards byte-by-byte copy (e.g., MVC), the chunk_size > + * must not be bigger than the address difference (in the worst case, 1 > byte). > + */ > +static void access_copy(CPUS390XState *env, S390Access write[], int nb_write, > + S390Access read[], int nb_read, target_ulong size, > + target_ulong chunk_size) > +{ > + target_ulong cur_size; > + void *buf = NULL; > + int r = 0; > + int w = 0; > + > + g_assert(chunk_size > 0); > + chunk_size = MIN(chunk_size, TARGET_PAGE_SIZE); > + > + while (size) { > + g_assert(w < nb_write); > + g_assert(r < nb_read); > + if (!write[w].size) { > + w++; > + continue; > + } > + if (!read[r].size) { > + r++; > + continue; > + } > + cur_size = MIN(MIN(MIN(size, write[w].size), read[r].size), > chunk_size); > + > + if (write[w].isHaddr && read[r].isHaddr) { > + memmove(write[w].haddr, read[r].haddr, > + cur_size); > + write[w].haddr += cur_size; > + read[r].haddr += cur_size; > +#ifndef CONFIG_USER_ONLY > + } else if (!write[w].isHaddr && read[r].isHaddr) { > + cpu_physical_memory_write(write[w].paddr, > + read[r].haddr, cur_size); > + write[w].paddr += cur_size; > + read[r].haddr += cur_size; > + } else if (write[w].isHaddr && !read[r].isHaddr) { > + cpu_physical_memory_read(read[r].paddr, > + write[w].haddr, cur_size); > + write[w].haddr += cur_size; > + read[r].paddr += cur_size; > + } else { > + if (!buf) { > + buf = g_malloc(chunk_size); > + } > + cpu_physical_memory_read(read[r].paddr, buf, cur_size); > + cpu_physical_memory_write(write[w].paddr, buf, cur_size); > + write[w].paddr += cur_size; > + read[r].paddr += cur_size; > +#else > + } else { > + g_assert_not_reached(); > +#endif > + } > + write[w].size -= cur_size; > + read[r].size -= cur_size; > + size -= cur_size; > + } > + g_free(buf); > +} > + > static void fast_memset(CPUS390XState *env, uint64_t dest, uint8_t byte, > uint32_t l, uintptr_t ra) > { > @@ -302,24 +479,34 @@ uint32_t HELPER(oc)(CPUS390XState *env, uint32_t l, > uint64_t dest, > static uint32_t do_helper_mvc(CPUS390XState *env, uint32_t l, uint64_t dest, > uint64_t src, uintptr_t ra) > { > - uint32_t i; > + /* 256 bytes cannot cross more than two pages */ > + S390Access read[2]; > + S390Access write[2]; > > HELPER_LOG("%s l %d dest %" PRIx64 " src %" PRIx64 "\n", > __func__, l, dest, src); > + l++; > > - /* mvc and memmove do not behave the same when areas overlap! */ > - /* mvc with source pointing to the byte after the destination is the > - same as memset with the first source byte */ > + g_assert(l <= 256); > + access_prepare(env, write, ARRAY_SIZE(write), dest, l, MMU_DATA_STORE, > ra); > + > + /* > + * The result of MVC is as if moving single bytes from left to right > + * (in contrast to memmove()). It can be used like memset(). > + */ > if (dest == src + 1) { > - fast_memset(env, dest, cpu_ldub_data_ra(env, src, ra), l + 1, ra); > - } else if (dest < src || src + l < dest) { > - fast_memmove(env, dest, src, l + 1, ra); > - } else { > - /* slow version with byte accesses which always work */ > - for (i = 0; i <= l; i++) { > - uint8_t x = cpu_ldub_data_ra(env, src + i, ra); > - cpu_stb_data_ra(env, dest + i, x, ra); > - } > + access_set(env, write, ARRAY_SIZE(write), > + cpu_ldub_data_ra(env, src, ra), l); > + return env->cc_op; > + } > + > + access_prepare(env, read, ARRAY_SIZE(read), src, l, MMU_DATA_LOAD, ra); > + if (dest < src || src + l <= dest) { > + access_copy(env, write, ARRAY_SIZE(write), read, ARRAY_SIZE(read), l, > + TARGET_PAGE_SIZE); > + } else if (src < dest) { > + access_copy(env, write, ARRAY_SIZE(write), read, ARRAY_SIZE(read), l, > + dest - src); > } > > return env->cc_op; >