On Wed, 17 Feb 2021 at 04:31, Richard Henderson <richard.hender...@linaro.org> wrote: > On 2/16/21 8:15 AM, Thomas Huth wrote: > With > > diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc > index 1376cdc404..3c5f38be62 100644 > --- a/tcg/aarch64/tcg-target.c.inc > +++ b/tcg/aarch64/tcg-target.c.inc > @@ -1622,6 +1622,8 @@ static void tcg_out_tlb_read > TCGType mask_type; > uint64_t compare_mask; > > + tcg_out_mb(s, TCG_MO_ALL); > + > mask_type = (TARGET_PAGE_BITS + CPU_TLB_DYN_MAX_BITS > 32 > ? TCG_TYPE_I64 : TCG_TYPE_I32); > > which is a gigantic hammer, adding a host barrier before every qemu guest > access, I can no longer provoke a failure (previously visible 1 in 4, now no > failures in 100). > > With that as a data point for success, I'm going to try to use host > load-acquire / store-release instructions, and then apply TCG_GUEST_DEFAULT_MO > and see if I can find something that works reasonably.
This isn't aarch64-host-specific, though, is it? It's going to be the situation for any host with a relaxed memory model. Do we really want to make all loads and stores lower-performance by adding in the ldacq/strel (or worse, barriers everywhere on host archs without ldacq/strel)? I feel like there ought to be an alternate approach involving using some kind of exclusion to ensure that we don't run the iothreads in parallel with the vCPU thread if we're using the non-MTTCG setup where all the vCPUs are on a single thread, and that that's probably less of a perf hit. thanks -- PMM