from:"Emilio G. Cota"

[Qemu-devel] [PATCH 3/3] rcu: add liburcu knob to configure script

2015-02-03 Thread Emilio G. Cota

The default is --disable-liburcu, which means QEMU's RCU
implementation is used.

When building with --enable-liburcu, the memory barrier version
of liburcu (urcu-mb) is linked against. We might want to make this
configurable in the future for greater performance; for now we just
match the liburcu flavor to the one that QEMU's RCU is using.

Note that the rcutorture under test/ doesn't work with --enable-liburcu,
since it relies on some internals of QEMU's RCU. liburcu is tested with
the urcutorture distributed with it.

Signed-off-by: Emilio G. Cota 
---
 configure | 35 +++
 include/qemu/atomic.h |  4 
 include/qemu/rcu.h| 38 --
 util/Makefile.objs|  2 +-
 4 files changed, 64 insertions(+), 15 deletions(-)

diff --git a/configure b/configure
index f185dd0..71c9802 100755
--- a/configure
+++ b/configure
@@ -335,6 +335,7 @@ libssh2=""
 vhdx=""
 quorum=""
 numa=""
+liburcu=""
 
 # parse CC options first
 for opt do
@@ -1129,6 +1130,10 @@ for opt do
   ;;
   --enable-numa) numa="yes"
   ;;
+  --disable-liburcu) liburcu="no"
+  ;;
+  --enable-liburcu) liburcu="yes"
+  ;;
   *)
   echo "ERROR: unknown option $opt"
   echo "Try '$0 --help' for more information"
@@ -1400,6 +1405,8 @@ Advanced options (experts only):
   --enable-quorum  enable quorum block filter support
   --disable-numa   disable libnuma support
   --enable-numaenable libnuma support
+  --disable-liburcudisable liburcu, use QEMU RCU instead
+  --enable-liburcu enable liburcu, the user-space RCU library
 
 NOTE: The object files are built at the place where configure is launched
 EOF
@@ -3277,6 +3284,29 @@ EOF
 fi
 
 ##
+# liburcu probe
+
+if test "$liburcu" != "no" ; then
+  cat > $TMPC << EOF
+char synchronize_rcu_mb();
+int main()
+{
+return synchronize_rcu_mb ();
+}
+EOF
+
+  if compile_prog "" "-lurcu-mb" ; then
+liburcu=yes
+libs_softmmu="-lurcu-mb $libs_softmmu"
+  else
+if test "$liburcu" = "yes" ; then
+  feature_not_found "liburcu" "install liburcu > v0.6, or 
--disable-liburcu"
+fi
+liburcu=no
+  fi
+fi
+
+##
 # signalfd probe
 signalfd="no"
 cat > $TMPC << EOF
@@ -4370,6 +4400,7 @@ echo "Quorum$quorum"
 echo "lzo support   $lzo"
 echo "snappy support$snappy"
 echo "NUMA host support $numa"
+echo "liburcu support   $liburcu"
 
 if test "$sdl_too_old" = "yes"; then
 echo "-> Your SDL version is too old - please upgrade to have SDL support"
@@ -5368,6 +5399,10 @@ if test "$numa" = "yes"; then
   echo "CONFIG_NUMA=y" >> $config_host_mak
 fi
 
+if test "$liburcu" = "yes"; then
+  echo "CONFIG_LIBURCU=y" >> $config_host_mak
+fi
+
 # build tree in object directory in case the source is not in the current 
directory
 DIRS="tests tests/tcg tests/tcg/cris tests/tcg/lm32 tests/libqos 
tests/qapi-schema tests/tcg/xtensa tests/qemu-iotests"
 DIRS="$DIRS fsdev"
diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h
index 4795e28..d7ad0d5 100644
--- a/include/qemu/atomic.h
+++ b/include/qemu/atomic.h
@@ -129,6 +129,8 @@
 #define atomic_set(ptr, i) ((*(__typeof__(*ptr) volatile*) (ptr)) = (i))
 #endif
 
+#ifndef CONFIG_LIBURCU
+
 /**
  * rcu_dereference - reads a RCU-protected pointer to a local variable
  * into a RCU read-side critical section. The pointer can later be safely
@@ -191,6 +193,8 @@
 #endif
 #endif
 
+#endif /* !CONFIG_LIBURCU */
+
 /* These have the same semantics as Java volatile variables.
  * See http://gee.cs.oswego.edu/dl/jmm/cookbook.html:
  * "1. Issue a StoreStore barrier (wmb) before each volatile store."
diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h
index 492d943..0aeb8f0 100644
--- a/include/qemu/rcu.h
+++ b/include/qemu/rcu.h
@@ -40,6 +40,29 @@
 extern "C" {
 #endif
 
+struct rcu_head;
+typedef void RCUCBFunc(struct rcu_head *head);
+
+/* The operands of the minus operator must have the same type,
+ * which must be the one that we specify in the cast.
+ */
+#define call_rcu_first_elem(head, func, field)   \
+call_rcu(({  \
+ char __attribute__((unused))\
+offset_must_be_zero[-offsetof(typeof(*(head)), field)],  \
+func_type_invalid = (func) - (void (*)(typeof(head)))(func); \
+ &(head)->field;

[Qemu-devel] [PATCH 0/3] rcu: add option to use upstream liburcu

2015-02-03 Thread Emilio G. Cota

Hi Paolo + all,

First off, thanks for your ongoing RCU work, which I'm closely
following.

As you stated in the initial RCU patch (7911747b), the intent
is to keep the same API as liburcu's. I checked whether this
was the case and found a couple of issues that I'm addressing
in the appended series.

* The first two patches bring back the RCU API to exactly
  match that of liburcu.

* The third patch adds a configure flag to choose from either
  liburcu or QEMU's RCU.

Apart from this, I wonder what to do about other valuable bits in
liburcu, particularly in liburcu-cds, which I'm using currently
off-tree.  I see three ways of eventually doing this:

a) Add Windows support in liburcu, thereby eliminating the
   de facto fork in QEMU.

b) Bring (fork) liburcu-cds to QEMU, just like liburcu-mb was.

c) Add a compile-time flag (say CONFIG_LIBURCU_CDS), and then only
   use data structures from liburcu-cds where appropriate, falling
   back to traditional locked structures when !CONFIG_LIBURCU_CDS.

Currently I'm using c) because I cannot do either a) and b)--I
don't know anything about Windows and have no need for it.

Would c) be acceptable for upstream, provided the gains (say in
scalability/memory footprint) are significant?

Thanks,

Emilio

[Qemu-devel] [PATCH 1/3] rcu: use call_rcu semantics from liburcu

2015-02-03 Thread Emilio G. Cota

The recently added call_rcu does not match liburcu's call_rcu's semantics;
instead, call_rcu1 does. Fix it by doing the following:

- Rename QEMU's call_rcu  to call_rcu_first_elem
- Rename QEMU's call_rcu1 to call_rcu

The end goal is to be able to switch at compile-time between
liburcu and QEMU's RCU implementation.

Signed-off-by: Emilio G. Cota 
---
 docs/rcu.txt   | 21 ++---
 include/qemu/rcu.h |  6 +++---
 memory.c   |  4 ++--
 util/rcu.c |  2 +-
 4 files changed, 16 insertions(+), 17 deletions(-)

diff --git a/docs/rcu.txt b/docs/rcu.txt
index 61752b9..575f563 100644
--- a/docs/rcu.txt
+++ b/docs/rcu.txt
@@ -85,8 +85,8 @@ The core RCU API is small:
 synchronize_rcu.  If this is not possible (for example, because
 the updater is protected by the BQL), you can use call_rcu.
 
- void call_rcu1(struct rcu_head * head,
-void (*func)(struct rcu_head *head));
+ void call_rcu(struct rcu_head * head,
+   void (*func)(struct rcu_head *head));
 
 This function invokes func(head) after all pre-existing RCU
 read-side critical sections on all threads have completed.  This
@@ -106,7 +106,7 @@ The core RCU API is small:
 so that the reclaimer function can fetch the struct foo address
 and free it:
 
-call_rcu1(&foo.rcu, foo_reclaim);
+call_rcu(&foo.rcu, foo_reclaim);
 
 void foo_reclaim(struct rcu_head *rp)
 {
@@ -117,15 +117,15 @@ The core RCU API is small:
 For the common case where the rcu_head member is the first of the
 struct, you can use the following macro.
 
- void call_rcu(T *p,
-   void (*func)(T *p),
-   field-name);
+ void call_rcu_first_elem(T *p,
+  void (*func)(T *p),
+  field-name);
 
-call_rcu1 is typically used through this macro, in the common case
+call_rcu is typically used through this macro, in the common case
 where the "struct rcu_head" is the first field in the struct.  In
 the above case, one could have written simply:
 
-call_rcu(foo_reclaim, g_free, rcu);
+call_rcu_first_elem(foo_reclaim, g_free, rcu);
 
  typeof(*p) atomic_rcu_read(p);
 
@@ -196,10 +196,9 @@ DIFFERENCES WITH LINUX
 - atomic_rcu_read and atomic_rcu_set replace rcu_dereference and
   rcu_assign_pointer.  They take a _pointer_ to the variable being accessed.
 
-- call_rcu is a macro that has an extra argument (the name of the first
-  field in the struct, which must be a struct rcu_head), and expects the
+- call_rcu_first_elem is a macro that has an extra argument (the name of the
+  first field in the struct, which must be a struct rcu_head), and expects the
   type of the callback's argument to be the type of the first argument.
-  call_rcu1 is the same as Linux's call_rcu.
 
 
 RCU PATTERNS
diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h
index 068a279..492d943 100644
--- a/include/qemu/rcu.h
+++ b/include/qemu/rcu.h
@@ -126,13 +126,13 @@ struct rcu_head {
 RCUCBFunc *func;
 };
 
-extern void call_rcu1(struct rcu_head *head, RCUCBFunc *func);
+extern void call_rcu(struct rcu_head *head, RCUCBFunc *func);
 
 /* The operands of the minus operator must have the same type,
  * which must be the one that we specify in the cast.
  */
-#define call_rcu(head, func, field)  \
-call_rcu1(({ \
+#define call_rcu_first_elem(head, func, field)   \
+call_rcu(({  \
  char __attribute__((unused))\
 offset_must_be_zero[-offsetof(typeof(*(head)), field)],  \
 func_type_invalid = (func) - (void (*)(typeof(head)))(func); \
diff --git a/memory.c b/memory.c
index 9b91243..dc5e4e9 100644
--- a/memory.c
+++ b/memory.c
@@ -755,7 +755,7 @@ static void address_space_update_topology(AddressSpace *as)
 
 /* Writes are protected by the BQL.  */
 atomic_rcu_set(&as->current_map, new_view);
-call_rcu(old_view, flatview_unref, rcu);
+call_rcu_first_elem(old_view, flatview_unref, rcu);
 
 /* Note that all the old MemoryRegions are still alive up to this
  * point.  This relieves most MemoryListeners from the need to
@@ -1983,7 +1983,7 @@ void address_space_destroy(AddressSpace *as)
  * entries that the guest should never use.  Wait for the old
  * values to expire before freeing the data.
  */
-call_rcu(as, do_address_space_destroy, rcu);
+call_rcu_first_elem(as, do_address_space_destroy, rcu);
 }
 
 bool io_mem_read(MemoryRegion *mr, hwaddr addr, uint64_t *pval, unsigned size)
diff --git a/util/rcu.c b/util/rcu.c
index c9c3e6e..309a6e4 100644
--- a

[Qemu-devel] [PATCH 2/3] rcu: use rcu_{dereference, assign_pointer} instead of atomic_rcu_{read, set}

2015-02-03 Thread Emilio G. Cota

This matches the semantics of liburcu.

Signed-off-by: Emilio G. Cota 
---
 docs/rcu.txt  | 33 +++--
 include/qemu/atomic.h | 35 ++-
 memory.c  |  6 +++---
 tests/rcutorture.c|  4 ++--
 4 files changed, 38 insertions(+), 40 deletions(-)

diff --git a/docs/rcu.txt b/docs/rcu.txt
index 575f563..154b18a 100644
--- a/docs/rcu.txt
+++ b/docs/rcu.txt
@@ -127,13 +127,13 @@ The core RCU API is small:
 
 call_rcu_first_elem(foo_reclaim, g_free, rcu);
 
- typeof(*p) atomic_rcu_read(p);
+ typeof(p) rcu_dereference(p);
 
-atomic_rcu_read() is similar to atomic_mb_read(), but it makes
+rcu_dereference() is similar to atomic_mb_read(), but it makes
 some assumptions on the code that calls it.  This allows a more
 optimized implementation.
 
-atomic_rcu_read assumes that whenever a single RCU critical
+rcu_dereference assumes that whenever a single RCU critical
 section reads multiple shared data, these reads are either
 data-dependent or need no ordering.  This is almost always the
 case when using RCU, because read-side critical sections typically
@@ -141,7 +141,7 @@ The core RCU API is small:
 every update) until reaching a data structure of interest,
 and then read from there.
 
-RCU read-side critical sections must use atomic_rcu_read() to
+RCU read-side critical sections must use rcu_dereference() to
 read data, unless concurrent writes are presented by another
 synchronization mechanism.
 
@@ -149,18 +149,18 @@ The core RCU API is small:
 data structure in a single direction, opposite to the direction
 in which the updater initializes it.
 
- void atomic_rcu_set(p, typeof(*p) v);
+ void rcu_assign_pointer(p, typeof(p) v);
 
-atomic_rcu_set() is also similar to atomic_mb_set(), and it also
+rcu_assign_pointer() is also similar to atomic_mb_set(), and it also
 makes assumptions on the code that calls it in order to allow a more
 optimized implementation.
 
-In particular, atomic_rcu_set() suffices for synchronization
+In particular, rcu_assign_pointer() suffices for synchronization
 with readers, if the updater never mutates a field within a
 data item that is already accessible to readers.  This is the
 case when initializing a new copy of the RCU-protected data
 structure; just ensure that initialization of *p is carried out
-before atomic_rcu_set() makes the data item visible to readers.
+before rcu_assign_pointer() makes the data item visible to readers.
 If this rule is observed, writes will happen in the opposite
 order as reads in the RCU read-side critical sections (or if
 there is just one update), and there will be no need for other
@@ -193,9 +193,6 @@ DIFFERENCES WITH LINUX
   programming; not allowing this would prevent upgrading an RCU read-side
   critical section to become an updater.
 
-- atomic_rcu_read and atomic_rcu_set replace rcu_dereference and
-  rcu_assign_pointer.  They take a _pointer_ to the variable being accessed.
-
 - call_rcu_first_elem is a macro that has an extra argument (the name of the
   first field in the struct, which must be a struct rcu_head), and expects the
   type of the callback's argument to be the type of the first argument.
@@ -237,7 +234,7 @@ may be used as a restricted reference-counting mechanism.  
For example,
 consider the following code fragment:
 
 rcu_read_lock();
-p = atomic_rcu_read(&foo);
+p = rcu_dereference(foo);
 /* do something with p. */
 rcu_read_unlock();
 
@@ -248,7 +245,7 @@ The write side looks simply like this (with appropriate 
locking):
 
 qemu_mutex_lock(&foo_mutex);
 old = foo;
-atomic_rcu_set(&foo, new);
+rcu_assign_pointer(foo, new);
 qemu_mutex_unlock(&foo_mutex);
 synchronize_rcu();
 free(old);
@@ -257,7 +254,7 @@ If the processing cannot be done purely within the critical 
section, it
 is possible to combine this idiom with a "real" reference count:
 
 rcu_read_lock();
-p = atomic_rcu_read(&foo);
+p = rcu_dereference(foo);
 foo_ref(p);
 rcu_read_unlock();
 /* do something with p. */
@@ -267,7 +264,7 @@ The write side can be like this:
 
 qemu_mutex_lock(&foo_mutex);
 old = foo;
-atomic_rcu_set(&foo, new);
+rcu_assign_pointer(foo, new);
 qemu_mutex_unlock(&foo_mutex);
 synchronize_rcu();
 foo_unref(old);
@@ -276,7 +273,7 @@ or with call_rcu:
 
 qemu_mutex_lock(&foo_mutex);
 old = foo;
-atomic_rcu_set(&foo, new);
+rcu_assign_pointer(foo, new);
 qemu_mutex_unlock(&foo_mutex);
 call_rcu(foo_unref, old, rcu);
 
@@ -355,7 +352,7 @@ Instead, we store the size of the array with the array 
itself:

Re: [Qemu-devel] [PATCH 2/3] rcu: use rcu_{dereference, assign_pointer} instead of atomic_rcu_{read, set}

2015-02-04 Thread Emilio G. Cota

On Wed, Feb 04, 2015 at 11:01:00 +0100, Paolo Bonzini wrote:
> On 03/02/2015 23:08, Emilio G. Cota wrote:
> > This matches the semantics of liburcu.
> 
> This is not necessary.  The two sets of macros are exactly the same, so
> it's okay to use atomic_rcu_read/write.

They're not exactly the same, otherwise the patch would be trivial.

The difference can be seen in the change to the macros' documentation:

On Tue, Feb 03, 2015 at 17:08:18 -0500, Emilio G. Cota wrote:
> This matches the semantics of liburcu.
> --- a/docs/rcu.txt
> +++ b/docs/rcu.txt
(snip)
> - typeof(*p) atomic_rcu_read(p);
> + typeof(p) rcu_dereference(p);
(snip)
> - void atomic_rcu_set(p, typeof(*p) v);
> + void rcu_assign_pointer(p, typeof(p) v);

The liburcu macros take a variable (usually a pointer, hence the
macros' names, but unsigned long is also common), not its address.

These changes require modifications to the calling code as well, e.g.
memory.c:

struct AddressSpace {
...
/* Accessed via RCU.  */
struct FlatView *current_map;
...
};
...
struct FlatView *view;
...
-view = atomic_rcu_read(&as->current_map);  


  
+view = rcu_dereference(as->current_map);
...
-atomic_rcu_set(&as->current_map, new_view);


  
+rcu_assign_pointer(as->current_map, new_view);  

Thanks,

Emilio

Re: [Qemu-devel] [PATCH 0/3] rcu: add option to use upstream liburcu

2015-02-04 Thread Emilio G. Cota

On Wed, Feb 04, 2015 at 11:32:57 +0100, Paolo Bonzini wrote:
> On 03/02/2015 23:08, Emilio G. Cota wrote:
> > * The first two patches bring back the RCU API to exactly
> >   match that of liburcu.
> 
> Bringing over rcu_dereference/rcu_assign_pointer is unnecessary, I
> think.  The names do not conflict.

On Wed, Feb 04, 2015 at 18:31:55 +0100, Paolo Bonzini wrote:
> On 04/02/2015 18:25, Emilio G. Cota wrote:
> > They're not exactly the same, otherwise the patch would be trivial.
> 
> You're right, I was imprecise---I meant they are interoperable.  You can
> use atomic_rcu_read/write together with liburcu, you do not need to use
> the liburcu-provided rcu_dereference/rcu_assign_pointer.

It's true that they can coexist. I'm just not sold on having two ways
of doing the same thing. It would make more sense to me to only divert
from the original API if there's a good reason to do so -- otherwise
an eventual merge (say after option a) discussed below) would be more
painful than necessary.

On Wed, Feb 04, 2015 at 11:32:57 +0100, Paolo Bonzini wrote:
> As to call_rcu/call_rcu1, I understand there is a problem.  Maybe call
> the QEMU versions rcu_call/rcu_call1?  Then you can add simple wrappers
> if you want to use liburcu-mb.

Again, adhering to the original API if possible makes more sense to me, to
prevent future confusion.

> > * The third patch adds a configure flag to choose from either
> >   liburcu or QEMU's RCU.
> > 
> > Apart from this, I wonder what to do about other valuable bits in
> > liburcu, particularly in liburcu-cds, which I'm using currently
> > off-tree.  I see three ways of eventually doing this:
> > 
> > a) Add Windows support in liburcu, thereby eliminating the
> >de facto fork in QEMU.
> 
> This would be certainly doable.
> 
> Note that liburcu is not widely packaged, so we would have to add it as
> a submodule.  What is the advantage of using liburcu?

Better testing, (eventually) less work, less bugs, no code duplication,
ability to just merge new features from upstream.

Essentially the usual reasons against forking a project.

> > b) Bring (fork) liburcu-cds to QEMU, just like liburcu-mb was.
> 
> To some extent this is what we're doing.  cds is very much inspired by
> the Linux kernel, but QEMU is already using both BSD (intrusive) and
> GLib (non-intrusive) lists, and I didn't like the idea of adding yet
> another API.  I like the simplicity of the Linux hlist/list APIs, but
> two kinds of lists are already one too many.

Agreed.

> So, part 2 of the RCU work has an API for RCU lists based on BSD lists
> (that QEMU is already using).  These are not the lock-free data
> structures available in CDS, just the usual RCU-based lists with
> blocking write side and wait-free read side.
> 
> QEMU has very limited support for (non-RCU-based) lock-free data
> structures in qemu/queue.h: see QSLIST_INSERT_HEAD_ATOMIC and
> QSLIST_MOVE_ATOMIC.  The call_rcu implementation in util/rcu.c is based
> on wfqueue from liburcu-cds, but it would not be hard to change it to
> use QSLIST_INSERT_HEAD_ATOMIC/QSLIST_MOVE_ATOMIC instead.  In both cases
> the data structure is multi-producer/single-consumer.
> 
> 
> QEMU hardly uses hash tables at all.

I understand there's currently not much demand for these. I think however
they might become valuable in the future, provided we end up having
a multi-threaded TCG.

> Another thing to note is that I didn't envision a huge use of RCU in
> QEMU; I'm employing it in decidedly central places where it can bring
> great benefit, but I wouldn't be surprised if RCU only found a handful
> of users in the end.

If RCU's history in the linux kernel is of any guide, chances are RCU
will end up being used in more places than one could initially guess:

  http://www2.rdrop.com/~paulmck/techreports/RCUUsage.2013.02.24a.pdf

I think the same reasoning is likely to apply to the concurrent data
structures in liburcu-cds.

> Coarse locks with very low contention (such as AioContext) have great
> performance, and most data structures in QEMU fit this model: data
> structures for separate devices are more or less independent, and the
> lock is only needed for the rare cases when the main thread (for example
> the monitor) interacts with the device.

Agreed, this has worked well so far.

> > c) Add a compile-time flag (say CONFIG_LIBURCU_CDS), and then only
> >use data structures from liburcu-cds where appropriate, falling
> >back to traditional locked structures when !CONFIG_LIBURCU_CDS.
> > 
> > Would c) be acceptable for upstream, provided the gains (say in
> > scalability/memory footprint) are significant?
> 
> I think if there were a kille

Re: [Qemu-devel] [PATCH 0/3] rcu: add option to use upstream liburcu

2015-02-04 Thread Emilio G. Cota

On Wed, Feb 04, 2015 at 22:17:44 +0100, Paolo Bonzini wrote:
> > What I'm investigating now is how to do this in a manner that is palatable
> > to upstream. For this as it's well known we need a multi-threaded TCG,
> > and I believe quite a few bits from liburcu(-cds) might help to
> > get there.
> 
> Are you using (or planning to use) a shared translated code cache, or a
> separate cache for each CPU?

Hoping to try both in the end--currently I'm far from trying either,
got quite a few things to fix before that.

Cheers,

Emilio

[Qemu-devel] [PATCH] translate-all: remove superfluous #ifdef FOO || 1

2015-03-20 Thread Emilio G. Cota

It always evaluates to true.

Signed-off-by: Emilio G. Cota 
---
 translate-all.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/translate-all.c b/translate-all.c
index 9f47ce7..11763c6 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -1334,8 +1334,6 @@ static inline void tb_alloc_page(TranslationBlock *tb,
 p->first_tb = (TranslationBlock *)((uintptr_t)tb | n);
 invalidate_page_bitmap(p);
 
-#if defined(TARGET_HAS_SMC) || 1
-
 #if defined(CONFIG_USER_ONLY)
 if (p->flags & PAGE_WRITE) {
 target_ulong addr;
@@ -1371,8 +1369,6 @@ static inline void tb_alloc_page(TranslationBlock *tb,
 tlb_protect_code(page_addr);
 }
 #endif
-
-#endif /* TARGET_HAS_SMC */
 }
 
 /* add a new TB and link it to the physical page tables. phys_page2 is
-- 
1.9.1

[Qemu-devel] [PATCH] translate-all: use bitmap helpers for PageDesc's bitmap

2015-03-20 Thread Emilio G. Cota

Note that this test
if (b & ((1 << len) - 1))
can be simplified to
if (b & 1)
, since we know that iff the first bit of a tb is set,
all other bits from that tb are set too.

Signed-off-by: Emilio G. Cota 
---
 translate-all.c | 39 +--
 1 file changed, 5 insertions(+), 34 deletions(-)

diff --git a/translate-all.c b/translate-all.c
index 11763c6..749d0f6 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -59,6 +59,7 @@
 
 #include "exec/cputlb.h"
 #include "translate-all.h"
+#include "qemu/bitmap.h"
 #include "qemu/timer.h"
 
 //#define DEBUG_TB_INVALIDATE
@@ -79,7 +80,7 @@ typedef struct PageDesc {
 /* in order to optimize self modifying code, we count the number
of lookups we do to a given page to use a bitmap */
 unsigned int code_write_count;
-uint8_t *code_bitmap;
+unsigned long *code_bitmap;
 #if defined(CONFIG_USER_ONLY)
 unsigned long flags;
 #endif
@@ -978,39 +979,12 @@ void tb_phys_invalidate(TranslationBlock *tb, 
tb_page_addr_t page_addr)
 tcg_ctx.tb_ctx.tb_phys_invalidate_count++;
 }
 
-static inline void set_bits(uint8_t *tab, int start, int len)
-{
-int end, mask, end1;
-
-end = start + len;
-tab += start >> 3;
-mask = 0xff << (start & 7);
-if ((start & ~7) == (end & ~7)) {
-if (start < end) {
-mask &= ~(0xff << (end & 7));
-*tab |= mask;
-}
-} else {
-*tab++ |= mask;
-start = (start + 8) & ~7;
-end1 = end & ~7;
-while (start < end1) {
-*tab++ = 0xff;
-start += 8;
-}
-if (start < end) {
-mask = ~(0xff << (end & 7));
-*tab |= mask;
-}
-}
-}
-
 static void build_page_bitmap(PageDesc *p)
 {
 int n, tb_start, tb_end;
 TranslationBlock *tb;
 
-p->code_bitmap = g_malloc0(TARGET_PAGE_SIZE / 8);
+p->code_bitmap = bitmap_new(TARGET_PAGE_SIZE);
 
 tb = p->first_tb;
 while (tb != NULL) {
@@ -1029,7 +1003,7 @@ static void build_page_bitmap(PageDesc *p)
 tb_start = 0;
 tb_end = ((tb->pc + tb->size) & ~TARGET_PAGE_MASK);
 }
-set_bits(p->code_bitmap, tb_start, tb_end - tb_start);
+bitmap_set(p->code_bitmap, tb_start, tb_end - tb_start);
 tb = tb->page_next[n];
 }
 }
@@ -1219,7 +1193,6 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, 
tb_page_addr_t end,
 void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len)
 {
 PageDesc *p;
-int offset, b;
 
 #if 0
 if (1) {
@@ -1235,9 +1208,7 @@ void tb_invalidate_phys_page_fast(tb_page_addr_t start, 
int len)
 return;
 }
 if (p->code_bitmap) {
-offset = start & ~TARGET_PAGE_MASK;
-b = p->code_bitmap[offset >> 3] >> (offset & 7);
-if (b & ((1 << len) - 1)) {
+if (test_bit(start & ~TARGET_PAGE_MASK, p->code_bitmap)) {
 goto do_invalidate;
 }
 } else {
-- 
1.9.1

[Qemu-devel] [PATCH] tcg: pack TCGTemp to reduce size by 8 bytes

2015-03-20 Thread Emilio G. Cota

This brings down the size of the struct from 56 to 48 bytes.

Signed-off-by: Emilio G. Cota 
---
 tcg/tcg.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tcg/tcg.h b/tcg/tcg.h
index add7f75..3276924 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -429,8 +429,8 @@ typedef struct TCGTemp {
 int val_type;
 int reg;
 tcg_target_long val;
-int mem_reg;
 intptr_t mem_offset;
+int mem_reg;
 unsigned int fixed_reg:1;
 unsigned int mem_coherent:1;
 unsigned int mem_allocated:1;
-- 
1.9.1

[Qemu-devel] [PATCH] target-i386: remove superfluous TARGET_HAS_SMC macro

2015-03-21 Thread Emilio G. Cota

Suggested-by: Paolo Bonzini 
Signed-off-by: Emilio G. Cota 
---
 target-i386/cpu.h | 2 --
 translate-all.c   | 4 
 2 files changed, 6 deletions(-)

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 15db6d7..4ee12ca 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -31,8 +31,6 @@
 /* Maximum instruction code size */
 #define TARGET_MAX_INSN_SIZE 16
 
-/* target supports implicit self modifying code */
-#define TARGET_HAS_SMC
 /* support for self modifying code even if the modified instruction is
close to the modifying instruction */
 #define TARGET_HAS_PRECISE_SMC
diff --git a/translate-all.c b/translate-all.c
index 9f47ce7..11763c6 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -1334,8 +1334,6 @@ static inline void tb_alloc_page(TranslationBlock *tb,
 p->first_tb = (TranslationBlock *)((uintptr_t)tb | n);
 invalidate_page_bitmap(p);
 
-#if defined(TARGET_HAS_SMC) || 1
-
 #if defined(CONFIG_USER_ONLY)
 if (p->flags & PAGE_WRITE) {
 target_ulong addr;
@@ -1371,8 +1369,6 @@ static inline void tb_alloc_page(TranslationBlock *tb,
 tlb_protect_code(page_addr);
 }
 #endif
-
-#endif /* TARGET_HAS_SMC */
 }
 
 /* add a new TB and link it to the physical page tables. phys_page2 is
-- 
1.9.1

[Qemu-devel] [PATCH] tcg: optimise memory layout of TCGTemp

2015-03-25 Thread Emilio G. Cota

This brings down the size of the struct from 56 to 32 bytes on 64-bit,
and to 16 bytes on 32-bit.

The appended adds macros to prevent us from mistakenly overflowing
the bitfields when more elements are added to the corresponding
enums/macros.

Note that reg/mem_reg need only 6 bits (for ia64) but for performance
is probably better to align them to a byte address.

Given that TCGTemp is used in large arrays this leads to a few KBs
of savings. However, unpacking the bits takes additional code, so
the net effect depends on the target (host is x86_64):

Before:
$ find . -name 'tcg.o' | xargs size
   textdata bss dec hex filename
  41131   29800  88   71019   1156b ./aarch64-softmmu/tcg/tcg.o
  37969   29416  96   67481   10799 ./x86_64-linux-user/tcg/tcg.o
  39354   28816  96   68266   10aaa ./arm-linux-user/tcg/tcg.o
  40802   29096  88   69986   11162 ./arm-softmmu/tcg/tcg.o
  39417   29672  88   69177   10e39 ./x86_64-softmmu/tcg/tcg.o

After:
$ find . -name 'tcg.o' | xargs size
   textdata bss dec hex filename
  41187   29800  88   71075   115a3 ./aarch64-softmmu/tcg/tcg.o
  3   29416  96   67289   106d9 ./x86_64-linux-user/tcg/tcg.o
  39162   28816  96   68074   109ea ./arm-linux-user/tcg/tcg.o
  40858   29096  88   70042   1119a ./arm-softmmu/tcg/tcg.o
  39473   29672  88   69233   10e71 ./x86_64-softmmu/tcg/tcg.o

Suggested-by: Stefan Weil 
Suggested-by: Richard Henderson 
Signed-off-by: Emilio G. Cota 
---
 tcg/tcg.h | 22 +-
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/tcg/tcg.h b/tcg/tcg.h
index add7f75..71ae7b2 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -193,7 +193,7 @@ typedef struct TCGPool {
 typedef enum TCGType {
 TCG_TYPE_I32,
 TCG_TYPE_I64,
-TCG_TYPE_COUNT, /* number of different types */
+TCG_TYPE_COUNT, /* number of different types, see TCG_TYPE_NR_BITS */
 
 /* An alias for the size of the host register.  */
 #if TCG_TARGET_REG_BITS == 32
@@ -217,6 +217,9 @@ typedef enum TCGType {
 #endif
 } TCGType;
 
+/* used for bitfield packing to save space */
+#define TCG_TYPE_NR_BITS 1
+
 /* Constants for qemu_ld and qemu_st for the Memory Operation field.  */
 typedef enum TCGMemOp {
 MO_8 = 0,
@@ -421,16 +424,14 @@ static inline TCGCond tcg_high_cond(TCGCond c)
 #define TEMP_VAL_REG   1
 #define TEMP_VAL_MEM   2
 #define TEMP_VAL_CONST 3
+#define TEMP_VAL_NR_BITS 2
 
-/* XXX: optimize memory layout */
 typedef struct TCGTemp {
-TCGType base_type;
-TCGType type;
-int val_type;
-int reg;
-tcg_target_long val;
-int mem_reg;
-intptr_t mem_offset;
+unsigned int reg:8;
+unsigned int mem_reg:8;
+unsigned int val_type:TEMP_VAL_NR_BITS;
+unsigned int base_type:TCG_TYPE_NR_BITS;
+unsigned int type:TCG_TYPE_NR_BITS;
 unsigned int fixed_reg:1;
 unsigned int mem_coherent:1;
 unsigned int mem_allocated:1;
@@ -438,6 +439,9 @@ typedef struct TCGTemp {
   basic blocks. Otherwise, it is not
   preserved across basic blocks. */
 unsigned int temp_allocated:1; /* never used for code gen */
+
+tcg_target_long val;
+intptr_t mem_offset;
 const char *name;
 } TCGTemp;
 
-- 
1.9.1

Re: [Qemu-devel] [PATCH] tcg: optimise memory layout of TCGTemp

2015-03-27 Thread Emilio G. Cota

On Fri, Mar 27, 2015 at 09:55:03 +, Alex Bennée wrote:
> Have you been able to measure any performance improvement with these new
> structures? In theory, if aligned with cache lines, performance should
> improve but real numbers would be nice.

I haven't benchmarked anything, which makes me very uneasy. All
I've checked is that the system boots, and FWIW I appreciate no
difference in boot time.

Is there a benchmark suite to test TCG changes?

Until proper benchmarking I wouldn't want to see this merged. For now I
propose to merge the initial change (remove 8-byte hole in 64-bit),
which is uncontroversial.

> > The appended adds macros to prevent us from mistakenly overflowing
> > the bitfields when more elements are added to the corresponding
> > enums/macros.
> 
> I can see the defines but I can't see any checks. Should we be able to
> do a compile time check if TCG_TYPE_COUNT doesn't fit into
> TCG_TYPE_NR_BITS?

> > +#define TEMP_VAL_NR_BITS 2
> 
> A similar compile time check could be added here.

Ack, addressed below.

On Fri, Mar 27, 2015 at 07:58:06 -0700, Richard Henderson wrote:
> On 03/25/2015 12:50 PM, Emilio G. Cota wrote:
> > +#define TCG_TYPE_NR_BITS 1
> 
> I'd rather you moved TCG_TYPE_COUNT out of the enum and into a define.  
> Perhaps
> even as (1 << TCG_TYPE_NR_BITS).
(snip)
> > +#define TEMP_VAL_NR_BITS 2
> 
> And make this an enumeration.
> 
> >  typedef struct TCGTemp {
(snip)
> > +unsigned int base_type:TCG_TYPE_NR_BITS;
> > +unsigned int type:TCG_TYPE_NR_BITS;
> 
> And do *not* change these from the enumeration to an unsigned int.
> 
> I know why you did this -- to keep the compiler from warning that the TCGType
> enum didn't fit in the bitfield, because of TCG_TYPE_COUNT being an 
> enumerator,
> rather than an unrelated number.  Except that's exactly the warning we want to
> keep, on the off-chance that someone modifies the enums without modifying the
> _NR_BITS defines.

Agreed, please see below.

Thanks,

E.

[No signoff due to lack of provable perf improvement, see above.]

diff --git a/tcg/tcg.h b/tcg/tcg.h
index add7f75..afd3f94 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -193,7 +193,6 @@ typedef struct TCGPool {
 typedef enum TCGType {
 TCG_TYPE_I32,
 TCG_TYPE_I64,
-TCG_TYPE_COUNT, /* number of different types */
 
 /* An alias for the size of the host register.  */
 #if TCG_TARGET_REG_BITS == 32
@@ -217,6 +216,10 @@ typedef enum TCGType {
 #endif
 } TCGType;
 
+/* used for bitfield packing to save space */
+#define TCG_TYPE_NR_BITS   1
+#define TCG_TYPE_COUNT BIT(TCG_TYPE_NR_BITS)
+
 /* Constants for qemu_ld and qemu_st for the Memory Operation field.  */
 typedef enum TCGMemOp {
 MO_8 = 0,
@@ -417,20 +420,21 @@ static inline TCGCond tcg_high_cond(TCGCond c)
 }
 }
 
-#define TEMP_VAL_DEAD  0
-#define TEMP_VAL_REG   1
-#define TEMP_VAL_MEM   2
-#define TEMP_VAL_CONST 3
+typedef enum TCGTempVal {
+TEMP_VAL_DEAD,
+TEMP_VAL_REG,
+TEMP_VAL_MEM,
+TEMP_VAL_CONST,
+} TCGTempVal;
+
+#define TEMP_VAL_NR_BITS 2
 
-/* XXX: optimize memory layout */
 typedef struct TCGTemp {
-TCGType base_type;
-TCGType type;
-int val_type;
-int reg;
-tcg_target_long val;
-int mem_reg;
-intptr_t mem_offset;
+unsigned int reg:8;
+unsigned int mem_reg:8;
+TCGTempVal val_type:TEMP_VAL_NR_BITS;
+TCGType base_type:TCG_TYPE_NR_BITS;
+TCGType type:TCG_TYPE_NR_BITS;
 unsigned int fixed_reg:1;
 unsigned int mem_coherent:1;
 unsigned int mem_allocated:1;
@@ -438,6 +442,9 @@ typedef struct TCGTemp {
   basic blocks. Otherwise, it is not
   preserved across basic blocks. */
 unsigned int temp_allocated:1; /* never used for code gen */
+
+tcg_target_long val;
+intptr_t mem_offset;
 const char *name;
 } TCGTemp;

[Qemu-devel] [PATCH v2] tcg: optimise memory layout of TCGTemp

2015-04-02 Thread Emilio G. Cota

This brings down the size of the struct from 56 to 32 bytes on 64-bit,
and to 20 bytes on 32-bit. This leads to memory savings:

Before:
$ find . -name 'tcg.o' | xargs size
   textdata bss dec hex filename
  41131   29800  88   71019   1156b ./aarch64-softmmu/tcg/tcg.o
  37969   29416  96   67481   10799 ./x86_64-linux-user/tcg/tcg.o
  39354   28816  96   68266   10aaa ./arm-linux-user/tcg/tcg.o
  40802   29096  88   69986   11162 ./arm-softmmu/tcg/tcg.o
  39417   29672  88   69177   10e39 ./x86_64-softmmu/tcg/tcg.o

After:
$ find . -name 'tcg.o' | xargs size
   textdata bss dec hex filename
  40883   29800  88   70771   11473 ./aarch64-softmmu/tcg/tcg.o
  37473   29416  96   66985   105a9 ./x86_64-linux-user/tcg/tcg.o
  38858   28816  96   67770   108ba ./arm-linux-user/tcg/tcg.o
  40554   29096  88   69738   1106a ./arm-softmmu/tcg/tcg.o
  39169   29672  88   68929   10d41 ./x86_64-softmmu/tcg/tcg.o

Note that using an entire byte for some enums that need less than
that wastes a few bits (noticeable in 32 bits, where we use
20 bytes instead of 16) but avoids extraction code, which overall
is a win--I've tested several variations of the patch, and the appended
is the best performer for OpenSSL's bntest by a very small margin:

Before:
$ taskset -c 0 perf stat -r 15 -- x86_64-linux-user/qemu-x86_64 
img/bntest-x86_64 >/dev/null
[...]
 Performance counter stats for 'x86_64-linux-user/qemu-x86_64 
img/bntest-x86_64' (15 runs):

  10538.479833 task-clock (msec) #0.999 CPUs utilized   
 ( +-  0.38% )
   772 context-switches  #0.073 K/sec   
 ( +-  2.03% )
 0 cpu-migrations#0.000 K/sec   
 ( +-100.00% )
 2,207 page-faults   #0.209 K/sec   
 ( +-  0.08% )
  10.552871687 seconds time elapsed 
 ( +-  0.39% )

After:
$ taskset -c 0 perf stat -r 15 -- x86_64-linux-user/qemu-x86_64 
img/bntest-x86_64 >/dev/null
 Performance counter stats for 'x86_64-linux-user/qemu-x86_64 
img/bntest-x86_64' (15 runs):

  10459.968847 task-clock (msec) #0.999 CPUs utilized   
 ( +-  0.30% )
   739 context-switches  #0.071 K/sec   
 ( +-  1.71% )
 0 cpu-migrations#0.000 K/sec   
 ( +- 68.14% )
 2,204 page-faults   #0.211 K/sec   
 ( +-  0.10% )
  10.473900411 seconds time elapsed 
 ( +-  0.30% )

Suggested-by: Stefan Weil 
Suggested-by: Richard Henderson 
Signed-off-by: Emilio G. Cota 
---
 tcg/tcg.h | 26 ++
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/tcg/tcg.h b/tcg/tcg.h
index add7f75..7f95132 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -417,20 +417,19 @@ static inline TCGCond tcg_high_cond(TCGCond c)
 }
 }
 
-#define TEMP_VAL_DEAD  0
-#define TEMP_VAL_REG   1
-#define TEMP_VAL_MEM   2
-#define TEMP_VAL_CONST 3
+typedef enum TCGTempVal {
+TEMP_VAL_DEAD,
+TEMP_VAL_REG,
+TEMP_VAL_MEM,
+TEMP_VAL_CONST,
+} TCGTempVal;
 
-/* XXX: optimize memory layout */
 typedef struct TCGTemp {
-TCGType base_type;
-TCGType type;
-int val_type;
-int reg;
-tcg_target_long val;
-int mem_reg;
-intptr_t mem_offset;
+unsigned int reg:8;
+unsigned int mem_reg:8;
+TCGTempVal val_type:8;
+TCGType base_type:8;
+TCGType type:8;
 unsigned int fixed_reg:1;
 unsigned int mem_coherent:1;
 unsigned int mem_allocated:1;
@@ -438,6 +437,9 @@ typedef struct TCGTemp {
   basic blocks. Otherwise, it is not
   preserved across basic blocks. */
 unsigned int temp_allocated:1; /* never used for code gen */
+
+tcg_target_long val;
+intptr_t mem_offset;
 const char *name;
 } TCGTemp;
 
-- 
1.9.1

Re: [Qemu-devel] [PATCH] translate-all: use bitmap helpers for PageDesc's bitmap

2015-04-02 Thread Emilio G. Cota

On Sat, Mar 21, 2015 at 02:25:42 -0400, Emilio G. Cota wrote:
> Note that this test
>   if (b & ((1 << len) - 1))
> can be simplified to
>   if (b & 1)
> , since we know that iff the first bit of a tb is set,
> all other bits from that tb are set too.
> 
> Signed-off-by: Emilio G. Cota 
> ---
>  translate-all.c | 39 +--
>  1 file changed, 5 insertions(+), 34 deletions(-)

*ping*
  http://patchwork.ozlabs.org/patch/452907/

Any takers? get_maintainer.pl doesn't suggest anybody to send this to.

This gets rid of an unnecessary (and hairy) function: bitmap_set().
I even took the time to double-check with a test program that
this function is indeed equivalent to bitmap_set().

Thanks,

Emilio

[Qemu-devel] [PATCH v2] translate-all: use glib for all page descriptor allocations

2015-04-09 Thread Emilio G. Cota

Since commit

  b7b5233a "bsd-user/mmap.c: Don't try to override g_malloc/g_free"

the exception we make here for usermode has been unnecessary.
Get rid of it.

Signed-off-by: Emilio G. Cota 
---
 translate-all.c | 18 ++
 1 file changed, 2 insertions(+), 16 deletions(-)

diff --git a/translate-all.c b/translate-all.c
index 4d05898..c8f0e8c 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -389,18 +389,6 @@ static PageDesc *page_find_alloc(tb_page_addr_t index, int 
alloc)
 void **lp;
 int i;
 
-#if defined(CONFIG_USER_ONLY)
-/* We can't use g_malloc because it may recurse into a locked mutex. */
-# define ALLOC(P, SIZE) \
-do {\
-P = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,\
- MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);   \
-} while (0)
-#else
-# define ALLOC(P, SIZE) \
-do { P = g_malloc0(SIZE); } while (0)
-#endif
-
 /* Level 1.  Always allocated.  */
 lp = l1_map + ((index >> V_L1_SHIFT) & (V_L1_SIZE - 1));
 
@@ -412,7 +400,7 @@ static PageDesc *page_find_alloc(tb_page_addr_t index, int 
alloc)
 if (!alloc) {
 return NULL;
 }
-ALLOC(p, sizeof(void *) * V_L2_SIZE);
+p = g_malloc0(sizeof(void *) * V_L2_SIZE);
 *lp = p;
 }
 
@@ -424,12 +412,10 @@ static PageDesc *page_find_alloc(tb_page_addr_t index, 
int alloc)
 if (!alloc) {
 return NULL;
 }
-ALLOC(pd, sizeof(PageDesc) * V_L2_SIZE);
+pd = g_malloc0(sizeof(PageDesc) * V_L2_SIZE);
 *lp = pd;
 }
 
-#undef ALLOC
-
 return pd + (index & (V_L2_SIZE - 1));
 }
 
-- 
1.9.1

[Qemu-devel] [PATCH] translate-all: use g_malloc0 for all page descriptor allocations

2015-04-09 Thread Emilio G. Cota

Since commit

  b7b5233a "bsd-user/mmap.c: Don't try to override g_malloc/g_free"

the exception we make here for usermode has been unnecessary.
Get rid of it.

Signed-off-by: Emilio G. Cota 
---
 translate-all.c | 18 ++
 1 file changed, 2 insertions(+), 16 deletions(-)

diff --git a/translate-all.c b/translate-all.c
index 11763c6..3a6d116 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -389,18 +389,6 @@ static PageDesc *page_find_alloc(tb_page_addr_t index, int 
alloc)
 void **lp;
 int i;
 
-#if defined(CONFIG_USER_ONLY)
-/* We can't use g_malloc because it may recurse into a locked mutex. */
-# define ALLOC(P, SIZE) \
-do {\
-P = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,\
- MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);   \
-} while (0)
-#else
-# define ALLOC(P, SIZE) \
-do { P = g_malloc0(SIZE); } while (0)
-#endif
-
 /* Level 1.  Always allocated.  */
 lp = l1_map + ((index >> V_L1_SHIFT) & (V_L1_SIZE - 1));
 
@@ -412,7 +400,7 @@ static PageDesc *page_find_alloc(tb_page_addr_t index, int 
alloc)
 if (!alloc) {
 return NULL;
 }
-ALLOC(p, sizeof(void *) * V_L2_SIZE);
+p = g_malloc0(sizeof(void *) * V_L2_SIZE);
 *lp = p;
 }
 
@@ -424,12 +412,10 @@ static PageDesc *page_find_alloc(tb_page_addr_t index, 
int alloc)
 if (!alloc) {
 return NULL;
 }
-ALLOC(pd, sizeof(PageDesc) * V_L2_SIZE);
+pd = g_malloc0(sizeof(PageDesc) * V_L2_SIZE);
 *lp = pd;
 }
 
-#undef ALLOC
-
 return pd + (index & (V_L2_SIZE - 1));
 }
 
-- 
1.9.1

[Qemu-devel] [PATCH v3] translate-all: use glib for all page descriptor allocations

2015-04-09 Thread Emilio G. Cota

Since commit

  b7b5233a "bsd-user/mmap.c: Don't try to override g_malloc/g_free"

the exception we make here for usermode has been unnecessary.
Get rid of it.

Signed-off-by: Emilio G. Cota 
---
 translate-all.c | 18 ++
 1 file changed, 2 insertions(+), 16 deletions(-)

diff --git a/translate-all.c b/translate-all.c
index 4d05898..acf792c 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -389,18 +389,6 @@ static PageDesc *page_find_alloc(tb_page_addr_t index, int 
alloc)
 void **lp;
 int i;
 
-#if defined(CONFIG_USER_ONLY)
-/* We can't use g_malloc because it may recurse into a locked mutex. */
-# define ALLOC(P, SIZE) \
-do {\
-P = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,\
- MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);   \
-} while (0)
-#else
-# define ALLOC(P, SIZE) \
-do { P = g_malloc0(SIZE); } while (0)
-#endif
-
 /* Level 1.  Always allocated.  */
 lp = l1_map + ((index >> V_L1_SHIFT) & (V_L1_SIZE - 1));
 
@@ -412,7 +400,7 @@ static PageDesc *page_find_alloc(tb_page_addr_t index, int 
alloc)
 if (!alloc) {
 return NULL;
 }
-ALLOC(p, sizeof(void *) * V_L2_SIZE);
+p = g_new0(void *, V_L2_SIZE);
 *lp = p;
 }
 
@@ -424,12 +412,10 @@ static PageDesc *page_find_alloc(tb_page_addr_t index, 
int alloc)
 if (!alloc) {
 return NULL;
 }
-ALLOC(pd, sizeof(PageDesc) * V_L2_SIZE);
+pd = g_new0(PageDesc, V_L2_SIZE);
 *lp = pd;
 }
 
-#undef ALLOC
-
 return pd + (index & (V_L2_SIZE - 1));
 }
 
-- 
1.9.1

[Qemu-devel] [PATCH] translate-all: remove redundant page_find from tb_invalidate_phys_page

2015-04-08 Thread Emilio G. Cota

The callers have just looked up the page descriptor, so there's no
point in searching again for it.

Signed-off-by: Emilio G. Cota 
---
 translate-all.c | 11 +++
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/translate-all.c b/translate-all.c
index 11763c6..4d05898 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -1246,14 +1246,13 @@ void tb_invalidate_phys_page_fast(tb_page_addr_t start, 
int len)
 }
 }
 
 #if !defined(CONFIG_SOFTMMU)
-static void tb_invalidate_phys_page(tb_page_addr_t addr,
+static void tb_invalidate_phys_page(PageDesc *p, tb_page_addr_t addr,
 uintptr_t pc, void *puc,
 bool locked)
 {
 TranslationBlock *tb;
-PageDesc *p;
 int n;
 #ifdef TARGET_HAS_PRECISE_SMC
 TranslationBlock *current_tb = NULL;
 CPUState *cpu = current_cpu;
@@ -1264,12 +1263,8 @@ static void tb_invalidate_phys_page(tb_page_addr_t addr,
 int current_flags = 0;
 #endif
 
 addr &= TARGET_PAGE_MASK;
-p = page_find(addr >> TARGET_PAGE_BITS);
-if (!p) {
-return;
-}
 tb = p->first_tb;
 #ifdef TARGET_HAS_PRECISE_SMC
 if (tb && pc != 0) {
 current_tb = tb_find_pc(pc);
@@ -1817,9 +1812,9 @@ void page_set_flags(target_ulong start, target_ulong end, 
int flags)
the code inside.  */
 if (!(p->flags & PAGE_WRITE) &&
 (flags & PAGE_WRITE) &&
 p->first_tb) {
-tb_invalidate_phys_page(addr, 0, NULL, false);
+tb_invalidate_phys_page(p, addr, 0, NULL, false);
 }
 p->flags = flags;
 }
 }
@@ -1911,9 +1906,9 @@ int page_unprotect(target_ulong address, uintptr_t pc, 
void *puc)
 prot |= p->flags;
 
 /* and since the content will be modified, we must invalidate
the corresponding translated code. */
-tb_invalidate_phys_page(addr, pc, puc, true);
+tb_invalidate_phys_page(p, addr, pc, puc, true);
 #ifdef DEBUG_TB_CHECK
 tb_invalidate_check(addr);
 #endif
 }
-- 
1.9.1

Re: [Qemu-devel] [PATCH] translate-all: use bitmap helpers for PageDesc's bitmap

2015-04-22 Thread Emilio G. Cota

On Wed, Apr 22, 2015 at 15:13:05 +0200, Paolo Bonzini wrote:
> On 21/03/2015 07:25, Emilio G. Cota wrote:
> > Note that this test
> > if (b & ((1 << len) - 1))
> > can be simplified to
> > if (b & 1)
> > , since we know that iff the first bit of a tb is set,
> > all other bits from that tb are set too.
> 
> I don't think this optimization is valid, unfortunately.  It is possible
> that say a tb starts at 0x12340001 and tb_invalidate_phys_page_fast is
> called for the range starting at 0x1234 and ending at 0x12340003.
> 
> In this case you need the full test that was used before.

Good catch, haven't seen this in my tests but it could possibly happen.

v2 coming up.

Thanks,

Emilio

[Qemu-devel] [PATCH v2] translate-all: use bitmap helpers for PageDesc's bitmap

2015-04-22 Thread Emilio G. Cota

Here we have an open-coded byte-based bitmap implementation.
Get rid of it since there's a ulong-based implementation to be
used by all code.

Signed-off-by: Emilio G. Cota 
---
 translate-all.c | 40 +++-
 1 file changed, 7 insertions(+), 33 deletions(-)

diff --git a/translate-all.c b/translate-all.c
index acf792c..9592a3b 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -59,6 +59,7 @@
 
 #include "exec/cputlb.h"
 #include "translate-all.h"
+#include "qemu/bitmap.h"
 #include "qemu/timer.h"
 
 //#define DEBUG_TB_INVALIDATE
@@ -79,7 +80,7 @@ typedef struct PageDesc {
 /* in order to optimize self modifying code, we count the number
of lookups we do to a given page to use a bitmap */
 unsigned int code_write_count;
-uint8_t *code_bitmap;
+unsigned long *code_bitmap;
 #if defined(CONFIG_USER_ONLY)
 unsigned long flags;
 #endif
@@ -964,39 +965,12 @@ void tb_phys_invalidate(TranslationBlock *tb, 
tb_page_addr_t page_addr)
 tcg_ctx.tb_ctx.tb_phys_invalidate_count++;
 }
 
-static inline void set_bits(uint8_t *tab, int start, int len)
-{
-int end, mask, end1;
-
-end = start + len;
-tab += start >> 3;
-mask = 0xff << (start & 7);
-if ((start & ~7) == (end & ~7)) {
-if (start < end) {
-mask &= ~(0xff << (end & 7));
-*tab |= mask;
-}
-} else {
-*tab++ |= mask;
-start = (start + 8) & ~7;
-end1 = end & ~7;
-while (start < end1) {
-*tab++ = 0xff;
-start += 8;
-}
-if (start < end) {
-mask = ~(0xff << (end & 7));
-*tab |= mask;
-}
-}
-}
-
 static void build_page_bitmap(PageDesc *p)
 {
 int n, tb_start, tb_end;
 TranslationBlock *tb;
 
-p->code_bitmap = g_malloc0(TARGET_PAGE_SIZE / 8);
+p->code_bitmap = bitmap_new(TARGET_PAGE_SIZE);
 
 tb = p->first_tb;
 while (tb != NULL) {
@@ -1015,7 +989,7 @@ static void build_page_bitmap(PageDesc *p)
 tb_start = 0;
 tb_end = ((tb->pc + tb->size) & ~TARGET_PAGE_MASK);
 }
-set_bits(p->code_bitmap, tb_start, tb_end - tb_start);
+bitmap_set(p->code_bitmap, tb_start, tb_end - tb_start);
 tb = tb->page_next[n];
 }
 }
@@ -1205,7 +1179,6 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, 
tb_page_addr_t end,
 void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len)
 {
 PageDesc *p;
-int offset, b;
 
 #if 0
 if (1) {
@@ -1221,8 +1194,9 @@ void tb_invalidate_phys_page_fast(tb_page_addr_t start, 
int len)
 return;
 }
 if (p->code_bitmap) {
-offset = start & ~TARGET_PAGE_MASK;
-b = p->code_bitmap[offset >> 3] >> (offset & 7);
+int nr = start & ~TARGET_PAGE_MASK;
+int b = p->code_bitmap[BIT_WORD(nr)] >> (nr & (BITS_PER_LONG - 1));
+
 if (b & ((1 << len) - 1)) {
 goto do_invalidate;
 }
-- 
1.9.1

Re: [Qemu-devel] [PATCH v2] translate-all: use bitmap helpers for PageDesc's bitmap

2015-04-22 Thread Emilio G. Cota

On Wed, Apr 22, 2015 at 22:30:23 +0200, Paolo Bonzini wrote:
> On 22/04/2015 18:53, Emilio G. Cota wrote:
> > @@ -1221,8 +1194,9 @@ void tb_invalidate_phys_page_fast(tb_page_addr_t 
> > start, int len)
> >  return;
> >  }
> >  if (p->code_bitmap) {
> > -offset = start & ~TARGET_PAGE_MASK;
> > -b = p->code_bitmap[offset >> 3] >> (offset & 7);
> > +int nr = start & ~TARGET_PAGE_MASK;
> 
> This should be unsigned, otherwise the compiler might not make
> BIT_WORD(nr) a simple right shift (maybe it can see that nr > 0 thanks
> to the AND, but in principle x / 32 is not the same as x >> 5 if x is
> signed).

OK. Note that all bit{map,ops} helpers take a long as the bit position,
and then do BIT_WORD() on it--I was following that convention wrt sign.

> > +int b = p->code_bitmap[BIT_WORD(nr)] >> (nr & (BITS_PER_LONG - 1));
> 
> And this should be unsigned long.

Used an int because we only care about the least significant bits (len <= 8):

> >  if (b & ((1 << len) - 1)) {
> >  goto do_invalidate;
> >  }

But yes, an unsigned long here is more appropriate.

Thanks,

Emilio

[Qemu-devel] [PATCH v3] translate-all: use bitmap helpers for PageDesc's bitmap

2015-04-22 Thread Emilio G. Cota

Here we have an open-coded byte-based bitmap implementation.
Get rid of it since there's a ulong-based implementation to be
used by all code.

Signed-off-by: Emilio G. Cota 
---
 translate-all.c | 42 +-
 1 file changed, 9 insertions(+), 33 deletions(-)

diff --git a/translate-all.c b/translate-all.c
index acf792c..0004d70 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -59,6 +59,7 @@
 
 #include "exec/cputlb.h"
 #include "translate-all.h"
+#include "qemu/bitmap.h"
 #include "qemu/timer.h"
 
 //#define DEBUG_TB_INVALIDATE
@@ -79,7 +80,7 @@ typedef struct PageDesc {
 /* in order to optimize self modifying code, we count the number
of lookups we do to a given page to use a bitmap */
 unsigned int code_write_count;
-uint8_t *code_bitmap;
+unsigned long *code_bitmap;
 #if defined(CONFIG_USER_ONLY)
 unsigned long flags;
 #endif
@@ -964,39 +965,12 @@ void tb_phys_invalidate(TranslationBlock *tb, 
tb_page_addr_t page_addr)
 tcg_ctx.tb_ctx.tb_phys_invalidate_count++;
 }
 
-static inline void set_bits(uint8_t *tab, int start, int len)
-{
-int end, mask, end1;
-
-end = start + len;
-tab += start >> 3;
-mask = 0xff << (start & 7);
-if ((start & ~7) == (end & ~7)) {
-if (start < end) {
-mask &= ~(0xff << (end & 7));
-*tab |= mask;
-}
-} else {
-*tab++ |= mask;
-start = (start + 8) & ~7;
-end1 = end & ~7;
-while (start < end1) {
-*tab++ = 0xff;
-start += 8;
-}
-if (start < end) {
-mask = ~(0xff << (end & 7));
-*tab |= mask;
-}
-}
-}
-
 static void build_page_bitmap(PageDesc *p)
 {
 int n, tb_start, tb_end;
 TranslationBlock *tb;
 
-p->code_bitmap = g_malloc0(TARGET_PAGE_SIZE / 8);
+p->code_bitmap = bitmap_new(TARGET_PAGE_SIZE);
 
 tb = p->first_tb;
 while (tb != NULL) {
@@ -1015,7 +989,7 @@ static void build_page_bitmap(PageDesc *p)
 tb_start = 0;
 tb_end = ((tb->pc + tb->size) & ~TARGET_PAGE_MASK);
 }
-set_bits(p->code_bitmap, tb_start, tb_end - tb_start);
+bitmap_set(p->code_bitmap, tb_start, tb_end - tb_start);
 tb = tb->page_next[n];
 }
 }
@@ -1205,7 +1179,6 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, 
tb_page_addr_t end,
 void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len)
 {
 PageDesc *p;
-int offset, b;
 
 #if 0
 if (1) {
@@ -1221,8 +1194,11 @@ void tb_invalidate_phys_page_fast(tb_page_addr_t start, 
int len)
 return;
 }
 if (p->code_bitmap) {
-offset = start & ~TARGET_PAGE_MASK;
-b = p->code_bitmap[offset >> 3] >> (offset & 7);
+unsigned int nr;
+unsigned long b;
+
+nr = start & ~TARGET_PAGE_MASK;
+b = p->code_bitmap[BIT_WORD(nr)] >> (nr & (BITS_PER_LONG - 1));
 if (b & ((1 << len) - 1)) {
 goto do_invalidate;
 }
-- 
1.9.1

[Qemu-devel] [PATCH] i440fx-test: remove ARRAY_SIZE redefinition

2015-04-26 Thread Emilio G. Cota

It's defined in osdep.h and shouldn't be redefined here.

Signed-off-by: Emilio G. Cota 
---
 tests/i440fx-test.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/tests/i440fx-test.c b/tests/i440fx-test.c
index d0bc8de..33a7ecb 100644
--- a/tests/i440fx-test.c
+++ b/tests/i440fx-test.c
@@ -27,8 +27,6 @@
 
 #define BROKEN 1
 
-#define ARRAY_SIZE(array) (sizeof(array) / sizeof((array)[0]))
-
 typedef struct TestData
 {
 int num_cpus;
-- 
1.9.1

Re: [Qemu-devel] [PATCH] i440fx-test: remove ARRAY_SIZE redefinition

2015-04-27 Thread Emilio G. Cota

On Sun, Apr 26, 2015 at 19:18:41 -0700, Peter Crosthwaite wrote:
> On Sun, Apr 26, 2015 at 3:04 PM, Emilio G. Cota  wrote:
> > It's defined in osdep.h and shouldn't be redefined here.
> >
> > Signed-off-by: Emilio G. Cota 
> 
> 
> Reviewed-by: Peter Crosthwaite 

Thanks for having a look!

> It looks like there may be another one too:
> 
> hw/audio/fmopl.c:#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))

That one is protected by #ifndef ARRAY_SIZE and AFAICT the file does not
include any qemu headers, so we can safely leave it as is.

Emilio

qemu-devel@nongnu.org

2015-04-27 Thread Emilio G. Cota

Signed-off-by: Emilio G. Cota 
---
 cpus.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/cpus.c b/cpus.c
index e6dcae3..62d157a 100644
--- a/cpus.c
+++ b/cpus.c
@@ -1016,7 +1016,7 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
 qemu_cond_signal(&qemu_cpu_cond);
 
 /* wait for initial kick-off after machine start */
-while (QTAILQ_FIRST(&cpus)->stopped) {
+while (first_cpu->stopped) {
 qemu_cond_wait(tcg_halt_cond, &qemu_global_mutex);
 
 /* process any pending work */
-- 
1.9.1

[Qemu-devel] [PATCH 0/6] trivial fixes

2015-04-27 Thread Emilio G. Cota

Here is a hodge-podge of fixes for issues I found while reviewing
QTAILQ callers, with a .gitignore patch as a bonus.

Thanks,

Emilio

[Qemu-devel] [PATCH 6/6] linux-user/elfload: use QTAILQ_FOREACH instead of open-coding it

2015-04-27 Thread Emilio G. Cota

Signed-off-by: Emilio G. Cota 
---
 linux-user/elfload.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/linux-user/elfload.c b/linux-user/elfload.c
index 399c021..0ba9706 100644
--- a/linux-user/elfload.c
+++ b/linux-user/elfload.c
@@ -2887,8 +2887,7 @@ static int write_note_info(struct elf_note_info *info, 
int fd)
 return (error);
 
 /* write prstatus for each thread */
-for (ets = info->thread_list.tqh_first; ets != NULL;
- ets = ets->ets_link.tqe_next) {
+QTAILQ_FOREACH(ets, &info->thread_list, ets_link) {
 if ((error = write_note(&ets->notes[0], fd)) != 0)
 return (error);
 }
-- 
1.9.1

[Qemu-devel] [PATCH 4/6] gitignore: ignore *.patch files

2015-04-27 Thread Emilio G. Cota

Signed-off-by: Emilio G. Cota 
---
 .gitignore | 1 +
 1 file changed, 1 insertion(+)

diff --git a/.gitignore b/.gitignore
index aed0e1f..025a841 100644
--- a/.gitignore
+++ b/.gitignore
@@ -62,6 +62,7 @@
 *.fn
 *.ky
 *.log
+*.patch
 *.pdf
 *.pod
 *.cps
-- 
1.9.1

[Qemu-devel] [PATCH 2/6] input: remove unused mouse_handlers list

2015-04-27 Thread Emilio G. Cota

Signed-off-by: Emilio G. Cota 
---
 ui/input-legacy.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/ui/input-legacy.c b/ui/input-legacy.c
index 2d4ca19..3e9bb38 100644
--- a/ui/input-legacy.c
+++ b/ui/input-legacy.c
@@ -57,8 +57,6 @@ struct QEMUPutLEDEntry {
 
 static QTAILQ_HEAD(, QEMUPutLEDEntry) led_handlers =
 QTAILQ_HEAD_INITIALIZER(led_handlers);
-static QTAILQ_HEAD(, QEMUPutMouseEntry) mouse_handlers =
-QTAILQ_HEAD_INITIALIZER(mouse_handlers);
 
 int index_from_key(const char *key)
 {
-- 
1.9.1

[Qemu-devel] [PATCH 3/6] qemu-char: remove unused list node from FDCharDriver

2015-04-27 Thread Emilio G. Cota

Signed-off-by: Emilio G. Cota 
---
 qemu-char.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/qemu-char.c b/qemu-char.c
index a405d76..d0c1564 100644
--- a/qemu-char.c
+++ b/qemu-char.c
@@ -973,7 +973,6 @@ typedef struct FDCharDriver {
 CharDriverState *chr;
 GIOChannel *fd_in, *fd_out;
 int max_size;
-QTAILQ_ENTRY(FDCharDriver) node;
 } FDCharDriver;
 
 /* Called with chr_write_lock held.  */
-- 
1.9.1

[Qemu-devel] [PATCH 5/6] coroutine: remove unnecessary parentheses in qemu_co_queue_empty

2015-04-27 Thread Emilio G. Cota

Signed-off-by: Emilio G. Cota 
---
 qemu-coroutine-lock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/qemu-coroutine-lock.c b/qemu-coroutine-lock.c
index e4860ae..6b49033 100644
--- a/qemu-coroutine-lock.c
+++ b/qemu-coroutine-lock.c
@@ -108,7 +108,7 @@ bool qemu_co_enter_next(CoQueue *queue)
 
 bool qemu_co_queue_empty(CoQueue *queue)
 {
-return (QTAILQ_FIRST(&queue->entries) == NULL);
+return QTAILQ_FIRST(&queue->entries) == NULL;
 }
 
 void qemu_co_mutex_init(CoMutex *mutex)
-- 
1.9.1

Re: [Qemu-devel] [RFC 00/10] MultiThread TCG.

2015-04-27 Thread Emilio G. Cota

Hi Fred,

On Wed, Apr 22, 2015 at 14:26:14 +0200, Frederic Konrad wrote:
> git clone g...@git.greensocs.com:fkonrad/mttcg.git -b multi_tcg_v4

I've tried to run buildroot's vexpress-a9 with this, but unfortunately
it gets stuck before showing much progress towards boot:
[ messages in brackets are mine ]

$ time arm-softmmu/qemu-system-arm -M vexpress-a9 -kernel img/arm/zImage \
-drive file=img/arm/rootfs.ext2,if=sd \
-append "console=ttyAMA0,115200 root=/dev/mmcblk0" -net 
nic,model=lan9118 \
-net user -nographic -smp 1

WARNING: Image format was not specified for 'img/arm/rootfs.ext2' and probing 
guessed raw.
 Automatically detecting the format is dangerous for raw images, write 
operations on block 0 will be restricted.
 Specify the 'raw' format explicitly to remove the restrictions.
tlb_flush: CPU 0[ this 11 times more ]
audio: Could not init `oss' audio driver
tlb_flush: CPU 0[ and this 31 times ]
[ then here it stops printing, and one thread takes 100% CPU ]

Note that I'm running with -smp 1. My guess is that the iothread
is starved, since patch 472f4003 "Drop global lock during TCG code execution"
removes from the iothread the ability to kick CPU threads.

Thanks,

Emilio

Re: [Qemu-devel] [PATCH 4/6] gitignore: ignore *.patch files

2015-04-27 Thread Emilio G. Cota

On Mon, Apr 27, 2015 at 12:28:00 -0600, Eric Blake wrote:
> On 04/27/2015 12:11 PM, Peter Crosthwaite wrote:
> > This Issue is discussed and I think it was concluded to not gitignore
> > patches. See:
> > 
> > commit f3a22014e94dfaacb57277dafce66b41cd994869
> > Author: Michael Tokarev 
> > Date:   Thu Jun 6 01:14:54 2013 +0400
> > 
> > gitignore: unignore *.patch
> 
> You can always ignore them locally:
> echo '*.patch' >> .git/info/exclude
> 
> without requiring everyone to ignore them (I actually like seeing them
> show up, to know when I have not yet hit send on a series).

Thanks for the tip!

All: please ignore this patch in the series.

Emilio

Re: [Qemu-devel] [RFC 00/10] MultiThread TCG.

2015-04-28 Thread Emilio G. Cota

On Tue, Apr 28, 2015 at 11:06:37 +0200, Paolo Bonzini wrote:
> On 27/04/2015 19:06, Emilio G. Cota wrote:
> > Note that I'm running with -smp 1. My guess is that the iothread
> > is starved, since patch 472f4003 "Drop global lock during TCG code 
> > execution"
> > removes from the iothread the ability to kick CPU threads.
> 
> In theory that shouldn't be necessary anymore.  The CPU thread should
> only hold the global lock for very small periods of time, similar to KVM.

You're right.

I added printouts around qemu_global_mutex_lock/unlock
and also added printouts around the cond_wait's that take the
BQL. The vcpu goes quiet after a while:

[...]
softmmu_template.h:io_writel:387 UNLO tid 17633
qemu/cputlb.c:tlb_protect_code:196 LOCK tid 17633
cputlb.c:tlb_protect_code:199 UNLO tid 17633
cputlb.c:tlb_protect_code:196 LOCK tid 17633
cputlb.c:tlb_protect_code:199 UNLO tid 17633
cputlb.c:tlb_protect_code:196 LOCK tid 17633
cputlb.c:tlb_protect_code:199 UNLO tid 17633
softmmu_template.h:io_readl:160 LOCK tid 17633
softmmu_template.h:io_readl:165 UNLO tid 17633
main-loop.c:os_host_main_loop_wait:242 LOCK tid 17630
main-loop.c:os_host_main_loop_wait:234 UNLO tid 17630

.. And at this point the last pair of LOCK/UNLO goes indefinitely.

> Can you post a backtrace?

$ sudo gdb --pid=8919
(gdb) info threads
  Id   Target Id Frame 
  3Thread 0x7596b700 (LWP 16204) "qemu-system-arm" syscall () at 
../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  2Thread 0x70f69700 (LWP 16206) "qemu-system-arm" 0x733179fe 
in ?? ()
* 1Thread 0x77fe4a80 (LWP 16203) "qemu-system-arm" 0x75e9b1ef 
in __GI_ppoll (fds=0x569b8f70, nfds=4, 
timeout=, sigmask=0x0) at 
../sysdeps/unix/sysv/linux/ppoll.c:56
(gdb) bt
#0  0x75e9b1ef in __GI_ppoll (fds=0x569b8f70, nfds=4, 
timeout=, sigmask=0x0)
at ../sysdeps/unix/sysv/linux/ppoll.c:56
#1  0x559a9e26 in qemu_poll_ns (fds=0x569b8f70, nfds=4, 
timeout=9689027) at qemu-timer.c:326
#2  0x559a8abb in os_host_main_loop_wait (timeout=9689027) at 
main-loop.c:239
#3  0x559a8bef in main_loop_wait (nonblocking=0) at main-loop.c:494
#4  0x5578c8c5 in main_loop () at vl.c:1803
#5  0x55794634 in main (argc=16, argv=0x7fffe828, 
envp=0x7fffe8b0) at vl.c:4371
(gdb) thread 3
[Switching to thread 3 (Thread 0x7596b700 (LWP 16204))]
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
38  ../sysdeps/unix/sysv/linux/x86_64/syscall.S: No such file or directory.
(gdb) bt
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x55a3a061 in futex_wait (ev=0x56392724 , 
val=4294967295) at util/qemu-thread-posix.c:305
#2  0x55a3a20b in qemu_event_wait (ev=0x56392724 
) at util/qemu-thread-posix.c:401
#3  0x55a5011d in call_rcu_thread (opaque=0x0) at util/rcu.c:231
#4  0x7617b182 in start_thread (arg=0x7596b700) at 
pthread_create.c:312
#5  0x75ea847d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) thread 2
[Switching to thread 2 (Thread 0x70f69700 (LWP 16206))]
#0  0x733179fe in ?? ()
(gdb) bt
#0  0x733179fe in ?? ()
#1  0x55f0c200 in ?? ()
#2  0x7fffd40029a0 in ?? ()
#3  0x7fffd40029c0 in ?? ()
#4  0x13b33d1714a74c00 in ?? ()
#5  0x70f685c0 in ?? ()
#6  0x555f9da7 in tcg_out_reloc (s=, 
code_ptr=, 
type=, 
label_index=, 
addend=) at /local/home/cota/src/qemu/tcg/tcg.c:224
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) q

So it seems that the vcpu thread doesn't come out of the execution loop
from which that last io_readl was performed.

Emilio

Re: [Qemu-devel] [PATCH 1/8] tls: require compiler support for __thread

2015-04-28 Thread Emilio G. Cota

On Tue, Jan 13, 2015 at 22:07:19 +, Peter Maydell wrote:
> On 13 January 2015 at 20:27, Paolo Bonzini  wrote:
> > On 13/01/2015 21:00, Peter Maydell wrote:
> >> Hmm. That's a chunk of users who are now going to have to
> >> change the way they've been building QEMU. Does configure
> >> at least blow up on the old gcc, or will we just silently
> >> build a non-working QEMU?
> >
> > At least the build, if not configure, should blow up.
> 
> Yeah, it does:
> qemu-coroutine.c:29: error: thread-local storage not supported for this target
> (that's gcc 4.2.1 20070719 on OpenBSD 5.5.)
> 
> As a new requirement it might be polite to make configure
> insist on it rather than failing the build later, maybe.

Just noticed that this patch was eventually dropped.

Would the appended be the missing piece to have this patch finally merged?

Thanks,

    Emilio

commit ad45e590025c1197a7aef5164e1ae174894b0969
Author: Emilio G. Cota 
Date:   Tue Apr 28 16:54:44 2015 -0400

configure: require __thread support

The codebase doesn't build without __thread support.
Formalise this requirement by adding a check for it in the
configure script.

Signed-off-by: Emilio G. Cota 

diff --git a/configure b/configure
index 6969f6f..3d6591f 100755
--- a/configure
+++ b/configure
@@ -1549,6 +1549,17 @@ if test "$static" = "yes" ; then
   fi
 fi
 
+# Unconditional check for compiler __thread support
+  cat > $TMPC << EOF
+static __thread int tls_var;
+int main(void) { return tls_var; }
+EOF
+
+if ! compile_prog "-Werror" "" ; then
+error_exit "Your compiler does not support the __thread specifier for " \
+   "Thread-Local Storage (TLS). Please upgrade to a version that does."
+fi
+
 if test "$pie" = ""; then
   case "$cpu-$targetos" in
 i386-Linux|x86_64-Linux|x32-Linux|i386-OpenBSD|x86_64-OpenBSD)

Re: [Qemu-devel] [RFC 0/5] Slow-path for atomic instruction translation

2015-05-26 Thread Emilio G. Cota

On Mon, May 11, 2015 at 11:10:05 +0200, alvise rigo wrote:
> the last commit was b8df9208f357d2b36e1b19634aea973618dc7ba8.

Thanks.

Unfortunately a segfault still happens very early:

$ gdb arm-softmmu/qemu-system-arm
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
...
Reading symbols from 
/local/home/cota/src/qemu/arm-softmmu/qemu-system-arm...done.
(gdb) set args  -M vexpress-a9 -kernel img/arm/zImage -drive 
file=img/arm/rootfs.ext2,if=sd -append "console=ttyAMA0,115200 
root=/dev/mmcblk0" -net nic,model=lan9118 -net user -nographic -smp 1
(gdb) r
Starting program: /local/home/cota/src/qemu/arm-softmmu/qemu-system-arm -M 
vexpress-a9 -kernel img/arm/zImage -drive file=img/arm/rootfs.ext2,if=sd 
-append "console=ttyAMA0,115200 root=/dev/mmcblk0" -net nic,model=lan9118 -net 
user -nographic -smp 1
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fffe9447700 (LWP 4309)]
[New Thread 0x7fffe5246700 (LWP 4310)]
WARNING: Image format was not specified for 'img/arm/rootfs.ext2' and probing 
guessed raw.
 Automatically detecting the format is dangerous for raw images, write 
operations on block 0 will be restricted.
 Specify the 'raw' format explicitly to remove the restrictions.
[New Thread 0x7fffe4a45700 (LWP 4311)]
audio: Could not init `oss' audio driver

Program received signal SIGUSR1, User defined signal 1.
[Switching to Thread 0x7fffe4a45700 (LWP 4311)]
pthread_cond_wait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
162 ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: No such 
file or directory.
(gdb) cont
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x55655c34 in test_bit (addr=, nr=)
at /local/home/cota/src/qemu/include/qemu/bitops.h:119
119 return 1UL & (addr[BIT_WORD(nr)] >> (nr & (BITS_PER_LONG-1)));
(gdb) bt
#0  0x55655c34 in test_bit (addr=, nr=)
at /local/home/cota/src/qemu/include/qemu/bitops.h:119
#1  cpu_physical_memory_excl_is_dirty (addr=18446744073709551615)
at /local/home/cota/src/qemu/include/exec/ram_addr.h:214
#2  tlb_set_page (cpu=, vaddr=, paddr=503316480, 
prot=, mmu_idx=3, size=)
at /local/home/cota/src/qemu/cputlb.c:327
#3  0x55712091 in arm_cpu_handle_mmu_fault (cs=0x5632c4e0, 
address=, access_type=0, mmu_idx=3)
at /local/home/cota/src/qemu/target-arm/helper.c:5726
#4  0x55704f70 in tlb_fill (cs=0x5632c4e0, addr=, 
is_write=, mmu_idx=, retaddr=140737065132893)
at /local/home/cota/src/qemu/target-arm/op_helper.c:69
#5  0x5565733f in helper_le_ldul_mmu (env=0x56334730, 
addr=503316484, mmu_idx=3, retaddr=)
at /local/home/cota/src/qemu/softmmu_template.h:190
#6  0x7fffe6c623db in code_gen_buffer ()
#7  0x556148ba in cpu_tb_exec (
tb_ptr=0x7fffe6c62320 "A\213n\374\205\355\017\205\207", cpu=0x5632c4e0)
at /local/home/cota/src/qemu/cpu-exec.c:199
#8  cpu_arm_exec (env=0x56334730)
at /local/home/cota/src/qemu/cpu-exec.c:519
#9  0x5563a880 in tcg_cpu_exec (env=0x56334730)
at /local/home/cota/src/qemu/cpus.c:1354
#10 tcg_exec_all () at /local/home/cota/src/qemu/cpus.c:1387
#11 qemu_tcg_cpu_thread_fn (arg=)
at /local/home/cota/src/qemu/cpus.c:1032
#12 0x740dfe9a in start_thread (arg=0x7fffe4a45700)
at pthread_create.c:308
#13 0x73e0d38d in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#14 0x in ?? ()

Emilio

Re: [Qemu-devel] [qemu] How to reliably obtain physaddr from vaddr

2015-03-15 Thread Emilio G. Cota

On Sun, Mar 15, 2015 at 16:10:21 -0700, Richard Henderson wrote:
> On 03/15/2015 03:00 AM, Emilio G. Cota wrote:
> > On a TLB hit this is trivial (just do nothing), but on
> > a TLB miss I'm lost on what to do--I cannot even follow
> > where helper_ld/st go (grep doesn't help), although I
> > suspect it's TCG backend ops and I don't see an obvious
> > way of adding a new operation there.
> 
> It goes into softmmu_template.h.  Which then tests a victim tlb, and finally
> calls tlb_fill.  You'll probably need to do the same.

Thanks, figured this out this morning after getting some sleep.

I've defined this vaddr->paddr as a helper and I'm calling it
before every aa32 store. However, this isn't a smooth sailing:

1. futex_init in the kernel causes an oops--it passes vaddr=0
   but the call happens with pagefaults disabled:
 http://lxr.free-electrons.com/source/kernel/futex.c?v=3.18#L590
   in the code below I'm just returning to avoid the oops.

2. The kernel (vexpress-a9 from buildroot) doesn't boot. It dies with:

> [...]
> devtmpfs: mounted
> Freeing unused kernel memory: 256K (805ea000 - 8062a000)
> Cannot continue, found non relative relocs during the bootstrap.
> Kernel panic - not syncing: Attempted to kill init! exitcode=0x0e00
> 
> CPU: 0 PID: 1 Comm: init Not tainted 3.18.8 #2
> [<800147cc>] (unwind_backtrace) from [<800115d4>] (show_stack+0x10/0x14)
> [<800115d4>] (show_stack) from [<80473e8c>] (dump_stack+0x84/0x94)
> [<80473e8c>] (dump_stack) from [<80471180>] (panic+0xa0/0x1f8)
> [<80471180>] (panic) from [<800240dc>] (complete_and_exit+0x0/0x1c)
> [<800240dc>] (complete_and_exit) from [<87827f70>] (0x87827f70)
> ---[ end Kernel panic - not syncing: Attempted to kill init! 
> exitcode=0x0e00

  Note that if I only call the helper before the "str[hb]*" stores,
  the kernel boots fine.

I'm appending the code (applies on top of current HEAD, 7ccfb495),
any help on what's going on greatly appreciated.

Thanks,

Emilio

diff --git a/cputlb.c b/cputlb.c
index 38f2151..5596bae 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -329,6 +329,7 @@ void tlb_set_page(CPUState *cpu, target_ulong vaddr,
 } else {
 te->addr_write = -1;
 }
+te->addr_phys = paddr;
 }
 
 /* NOTE: this function can trigger an exception */
diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index 0ca6f0b..65f340c 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -74,10 +74,10 @@ typedef uint64_t target_ulong;
 /* use a fully associative victim tlb of 8 entries */
 #define CPU_VTLB_SIZE 8
 
-#if HOST_LONG_BITS == 32 && TARGET_LONG_BITS == 32
-#define CPU_TLB_ENTRY_BITS 4
-#else
+#if TARGET_LONG_BITS == 32
 #define CPU_TLB_ENTRY_BITS 5
+#else
+#define CPU_TLB_ENTRY_BITS 6
 #endif
 
 typedef struct CPUTLBEntry {
@@ -90,13 +90,14 @@ typedef struct CPUTLBEntry {
 target_ulong addr_read;
 target_ulong addr_write;
 target_ulong addr_code;
+target_ulong addr_phys;
 /* Addend to virtual address to get host address.  IO accesses
use the corresponding iotlb value.  */
 uintptr_t addend;
 /* padding to get a power of two size */
 uint8_t dummy[(1 << CPU_TLB_ENTRY_BITS) -
-  (sizeof(target_ulong) * 3 +
-   ((-sizeof(target_ulong) * 3) & (sizeof(uintptr_t) - 1)) +
+  (sizeof(target_ulong) * 4 +
+   ((-sizeof(target_ulong) * 4) & (sizeof(uintptr_t) - 1)) +
sizeof(uintptr_t))];
 } CPUTLBEntry;
 
diff --git a/include/exec/cpu_ldst.h b/include/exec/cpu_ldst.h
index 1673287..168cde9 100644
--- a/include/exec/cpu_ldst.h
+++ b/include/exec/cpu_ldst.h
@@ -127,6 +127,9 @@ void helper_stl_mmu(CPUArchState *env, target_ulong addr,
 void helper_stq_mmu(CPUArchState *env, target_ulong addr,
 uint64_t val, int mmu_idx);
 
+hwaddr helper_st_paddr_mmu(CPUArchState *env, target_ulong addr,
+   int mmu_idx);
+
 uint8_t helper_ldb_cmmu(CPUArchState *env, target_ulong addr, int mmu_idx);
 uint16_t helper_ldw_cmmu(CPUArchState *env, target_ulong addr, int mmu_idx);
 uint32_t helper_ldl_cmmu(CPUArchState *env, target_ulong addr, int mmu_idx);
diff --git a/include/exec/cpu_ldst_template.h b/include/exec/cpu_ldst_template.h
index 95ab750..39cde9d 100644
--- a/include/exec/cpu_ldst_template.h
+++ b/include/exec/cpu_ldst_template.h
@@ -129,6 +129,39 @@ glue(glue(cpu_st, SUFFIX), MEMSUFFIX)(CPUArchState *env, 
target_ulong ptr,
 }
 }
 
+#if DATA_SIZE == 1
+
+static inline hwaddr
+glue(cpu_st_paddr, MEMSUFFIX)(CPUArchState *env, target_ulong ptr)
+{
+int page_index;
+target_ulong addr;
+int mmu_idx;
+hwaddr ret;
+
+addr = ptr;
+/*
+ * XXX Understand why this is necess

Re: [Qemu-devel] [qemu] How to reliably obtain physaddr from vaddr

2015-03-16 Thread Emilio G. Cota

On Sun, Mar 15, 2015 at 20:42:31 -0400, Emilio G. Cota wrote:
> On Sun, Mar 15, 2015 at 16:10:21 -0700, Richard Henderson wrote:
> > It goes into softmmu_template.h.  Which then tests a victim tlb, and finally
> > calls tlb_fill.  You'll probably need to do the same.
> 
> I've defined this vaddr->paddr as a helper and I'm calling it
> before every aa32 store. However, this isn't a smooth sailing:
> 
> 1. futex_init in the kernel causes an oops--it passes vaddr=0
>but the call happens with pagefaults disabled:
>  http://lxr.free-electrons.com/source/kernel/futex.c?v=3.18#L590
>in the code below I'm just returning to avoid the oops.

Please disregard this point--the oops doesn't happen with the code
I appended (it was triggered by previous iterations of it).

> 2. The kernel (vexpress-a9 from buildroot) doesn't boot.

Removing the call to tlb_fill() on a TLB miss solves the problem.
But of course this also means the helper doesn't work as intended.

I fail to see why calling tlb_fill() from the helper causes
trouble.  What I thought would happen is that the exception
(if any) is started from the helper, gets serviced, and then
both the helper and the subsequent store hit in the TLB.  I was
seeing this as a "TLB prefetch", but I cannot make it work.

What am I missing?

FWIW I'm appending the delta wrt my previous email.

Thanks,

Emilio

diff --git a/include/exec/cpu_ldst_template.h b/include/exec/cpu_ldst_template.h
index 39cde9d..48c54f9 100644
--- a/include/exec/cpu_ldst_template.h
+++ b/include/exec/cpu_ldst_template.h
@@ -140,14 +140,6 @@ glue(cpu_st_paddr, MEMSUFFIX)(CPUArchState *env, 
target_ulong ptr)
 hwaddr ret;
 
 addr = ptr;
-/*
- * XXX Understand why this is necessary.
- * futex_init on linux bootup calls cmpxchg on a NULL pointer. It expects
- * -EFAULT to be read back, but when we do the below we get a kernel oops.
- * However, when doing the load from TCG -EFAULT is read just fine--no 
oops.
- */
-if (unlikely(addr == 0))
-   return 0;
 page_index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
 mmu_idx = CPU_MMU_INDEX;
 if (unlikely(env->tlb_table[mmu_idx][page_index].addr_write !=
diff --git a/softmmu_template.h b/softmmu_template.h
index 172b718..1b6655e 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -476,7 +476,7 @@ hwaddr helper_ret_st_paddr(CPUArchState *env, target_ulong 
addr,
 if ((addr & TARGET_PAGE_MASK)
 != (tlb_addr & (TARGET_PAGE_MASK | TLB_INVALID_MASK))) {
 if (!VICTIM_TLB_HIT(addr_write)) {
-tlb_fill(ENV_GET_CPU(env), addr, MMU_DATA_STORE, mmu_idx, retaddr);
+return 0;
 }
 }
 return env->tlb_table[mmu_idx][index].addr_phys;

Re: [Qemu-devel] [qemu] How to reliably obtain physaddr from vaddr

2015-03-16 Thread Emilio G. Cota

On Mon, Mar 16, 2015 at 22:23:24 +, Peter Maydell wrote:
> On 16 March 2015 at 20:08, Emilio G. Cota  wrote:
> > I fail to see why calling tlb_fill() from the helper causes
> > trouble.  What I thought would happen is that the exception
> > (if any) is started from the helper, gets serviced, and then
> > both the helper and the subsequent store hit in the TLB.  I was
> > seeing this as a "TLB prefetch", but I cannot make it work.
> 
> This isn't how tlb_fill handles page faults. What happens is:
> 
> 1. tlb_fill calls arm_cpu_handle_mmu_fault to do the page table walk
(snip)
> 6. the guest OS may or may not end up fixing up the page tables
>and reattempting execution of whatever failed, but that's
>entirely up to it and might never happen

Great description--the last point wasn't all that clear to me.

> I suspect your problem is that the host retaddr in step 3
> is wrong, which will result in our generating the guest
> exception with a wrong value for "guest PC at point of fault".
> Linux makes extensive use of "if guest PC for this fault
> is in this magic bit of code then fix up the result so it
> looks like this kernel load/store accessor function returned
> -EFAULT". If you're reporting the wrong guest PC this won't
> work and the kernel will end up in the default case path
> of it being an unexpected kernel mode data abort and Oopsing.

That was indeed the problem, the TB from that retaddr couldn't
be found.

[ BTW why don't we check the return value of cpu_restore_state?
  I see this has been discussed before:
 http://lists.gnu.org/archive/html/qemu-devel/2012-09/msg02589.html
  Certainly an assert there would have helped me debug this. ]

> I suggest you check whether the exception PC reported to
> the guest is correct (it's probably reported by the kernel
> somewhere in the oops output) compared to the addresses in
> the kernel of the load/store/whatever that's faulted.

Yep. The fix is trivial:

+++ b/target-arm/helper.c
@@ -5797,7 +5797,7 @@ void HELPER(v7m_msr)(CPUARMState *env, uint32_t reg, 
uint32_t val)
 
 void HELPER(st_pre)(CPUARMState *env, uint32_t vaddr)
 {
-cpu_st_paddr_data(env, vaddr);
+helper_ret_st_paddr(env, vaddr, cpu_mmu_index(env), GETRA());
 }

.. and with this I learned that cpu_ld/st are only to be called
from op helpers.

Thanks a lot for your help.

Emilio

Re: [Qemu-devel] [RFC v5 0/6] Slow-path for atomic instruction translation

2015-10-01 Thread Emilio G. Cota

On Wed, Sep 30, 2015 at 06:44:32 +0200, Paolo Bonzini wrote:
> I have a doubt about your patches for ll/sc emulation, that I hope you
> can clarify.
> 
> From 1ft, both approaches rely on checking a flag during stores.
> This is split between the TLB and the CPUState for Alvise's patches (in
> order to exploit the existing fast-path checks), and entirely in the
> radix tree for Emilio's.  However, the idea is the same.

Not quite the same idea, see below.

> Now, the patch are okay for serial emulation, but I am not sure if it's
> possible to do lock-free ll/sc emulation, because there is a race.
> 
> If we check the flag before the store, the race is as follows:
> 
>CPU0CPU1
>---
>check flag
>load locked:
>   set flag
>   load value (normal load on CPU)
>store
>store conditional (normal store on CPU)
> 
> where the sc doesn't fail.  For completeness, if we check it afterwards
> (which would be possible with Emilio's approach, though not for the
> TLB-based one):
> 
>CPU0CPU1
>--
>load locked
>   set bit
>   load value (normal load on CPU)
>store
>store conditional (normal store on CPU)
>check flag
> 
> and again the sc doesn't fail.

(snip)
> Tell me I'm wrong. :)

The difference between Alvise's implementation and what I submitted is
that in the radix tree a cache line that has *ever* had an atomic op performed
on it, is marked as "slow path" for the rest of the execution, meaning
that *all* subsequent stores to said cache line will take the cache line
entry's lock.

This does not fix the race completely (e.g. you could have a store
and an atomic op concurrently executing on a line that hasn't yet
had an atomic op on it), but it significantly closes it. My guess
is that it would be very hard to trigger by practical programs.

E.

Re: [Qemu-devel] [PATCH 08/10] tcg: add memory barriers in page_find_alloc accesses

2015-08-12 Thread Emilio G. Cota

On Wed, Aug 12, 2015 at 18:41:00 +0200, Paolo Bonzini wrote:
> page_find is reading the radix tree outside all locks, so it has to
> use the RCU primitives.  It does not need RCU critical sections
> because the PageDescs are never removed, so there is never a need
> to wait for the end of code sections that use a PageDesc.

Note that rcu_find_alloc might end up writing to the tree, see below.

BTW the fact that there are no removals makes the use of RCU unnecessary.

> Signed-off-by: Paolo Bonzini 
> ---
>  translate-all.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/translate-all.c b/translate-all.c
> index 7727091..78a787d 100644
> --- a/translate-all.c
> +++ b/translate-all.c
> @@ -437,14 +437,14 @@ static PageDesc *page_find_alloc(tb_page_addr_t index, 
> int alloc)
>  
>  /* Level 2..N-1.  */
>  for (i = V_L1_SHIFT / V_L2_BITS - 1; i > 0; i--) {
> -void **p = *lp;
> +void **p = atomic_rcu_read(lp);
>  
>  if (p == NULL) {
>  if (!alloc) {
>  return NULL;
>  }
>  p = g_new0(void *, V_L2_SIZE);
> -*lp = p;
> +atomic_rcu_set(lp, p);
>  }
>  
>  lp = p + ((index >> (i * V_L2_BITS)) & (V_L2_SIZE - 1));
> @@ -456,7 +456,7 @@ static PageDesc *page_find_alloc(tb_page_addr_t index, 
> int alloc)
>  return NULL;
>  }
>  pd = g_new0(PageDesc, V_L2_SIZE);
> -*lp = pd;
> +atomic_rcu_set(lp, pd);

rcu_set is not enough; a cmpxchg with a fail path would be needed here, since
if the find_alloc is called without any locks held (as described in the commit 
message)
several threads could concurrently write to the same node, corrupting the tree.

I argue however that it is better to call page_find/_alloc with a mutex held,
since otherwise we'd have to add per-PageDesc locks (it's very common to
call page_find and then update the PageDesc). I have an RFC patchset for 
multithreaded tcg
that deals with this, will submit once I bring it up to date with upstream.

Emilio

Re: [Qemu-devel] [PATCH 08/10] tcg: add memory barriers in page_find_alloc accesses

2015-08-13 Thread Emilio G. Cota

On Thu, Aug 13, 2015 at 10:13:32 +0200, Paolo Bonzini wrote:
> On 12/08/2015 22:37, Emilio G. Cota wrote:
> > > page_find is reading the radix tree outside all locks, so it has to
> > > use the RCU primitives.  It does not need RCU critical sections
> > > because the PageDescs are never removed, so there is never a need
> > > to wait for the end of code sections that use a PageDesc.
> >
> > Note that rcu_find_alloc might end up writing to the tree, see below.
> 
> Yes, but in that case it's always called with the mmap_lock held, see
> patch 7.

Oh I see. Didn't have much time to take a deep look at the patchset; the
commit message got me confused.

> page_find_alloc is only called by tb_alloc_page (called by tb_link_page
> which takes mmap_lock), or by page_set_flags (called with mmap_lock held
> by linux-user/mmap.c).
> 
> > BTW the fact that there are no removals makes the use of RCU unnecessary.
> 
> It only makes it not use the RCU synchronization primitives.  You still
> need the memory barriers.

Yes. Personally I find it confusing to use the RCU macros just
for the convenience that they bring in barriers we need; I'd prefer to
explicitly have the barriers in the code.

> > I argue however that it is better to call page_find/_alloc with a mutex 
> > held,
> > since otherwise we'd have to add per-PageDesc locks (it's very common to
> > call page_find and then update the PageDesc). 
> 
> The fields are protected by either the mmap_lock (e.g. the flags, see
> page_unprotect and tb_alloc_page) or the tb_lock (e.g. the tb lists).
> 
> The code is complicated and could definitely use more documentation,
> especially for struct PageDesc, but it seems correct to me apart from
> the lock inversion fixed in patch 10.

OK. I have a bit of time today and tomorrow so I'll rebase my work on
top of this patchset and then submit it.

Thanks,

Emilio

Re: [Qemu-devel] [RFC 08/38] rcu: init rcu_registry_lock after fork

2015-09-08 Thread Emilio G. Cota

On Tue, Sep 08, 2015 at 18:34:38 +0100, Alex Bennée wrote:
> Emilio G. Cota  writes:
(snip)
> > +static void rcu_init_child(void)
> > +{
> > +qemu_mutex_init(&rcu_registry_lock);
> > +}
> >  #endif
> >  
> >  void rcu_after_fork(void)
> > @@ -346,7 +351,7 @@ void rcu_after_fork(void)
> >  static void __attribute__((__constructor__)) rcu_init(void)
> >  {
> >  #ifdef CONFIG_POSIX
> > -pthread_atfork(rcu_init_lock, rcu_init_unlock, rcu_init_unlock);
> > +pthread_atfork(rcu_init_lock, rcu_init_unlock, rcu_init_child);
> >  #endif
> 
> Hmm previously we unlocked both rcu_sync_lock and rcu_registry_lock, is
> it somehow different in it's locking rules? If I'm reading the
> pthread_atfork man page right couldn't we just do:
> 
> pthread_atfork(rcu_init_lock, rcu_init_unlock, rcu_init_lock);

That'd cause the child to deadlock.

Before forking the parent locks those mutexes; then it forks. The child sees
those mutexes as 'locked', and whatever memory the parent writes to _after_
fork is not seen by the child. So trying to 'lock' those mutexes from the child
is never going to succeed, because they're seeing as already locked.

The original idea ("unlock" from the child) is OK, but doesn't work when
using PTHREAD_MUTEX_ERRORCHECK. Yes, this was removed (24fa90499f) but
it's too useful to not cherry-pick it when developing MTTCG.

What I think remains to be fixed is the possibility of corruption when
any call_rcu calls are pending while forking. The child should not mess
with these; only the parent should. See call_rcu_after_fork_{parent,child}
from liburcu for details.

Emilio

Re: [Qemu-devel] [RFC 11/38] qemu-thread: handle spurious futex_wait wakeups

2015-09-10 Thread Emilio G. Cota

On Thu, Sep 10, 2015 at 14:22:49 +0100, Alex Bennée wrote:
> Emilio G. Cota  writes:
> 
> > Signed-off-by: Emilio G. Cota 
> > ---
> >  util/qemu-thread-posix.c | 11 ++-
> >  1 file changed, 10 insertions(+), 1 deletion(-)
> >
> > diff --git a/util/qemu-thread-posix.c b/util/qemu-thread-posix.c
> > index 04dae0f..3760e27 100644
> > --- a/util/qemu-thread-posix.c
> > +++ b/util/qemu-thread-posix.c
> > @@ -303,7 +303,16 @@ static inline void futex_wake(QemuEvent *ev, int n)
> >  
> >  static inline void futex_wait(QemuEvent *ev, unsigned val)
> >  {
> > -futex(ev, FUTEX_WAIT, (int) val, NULL, NULL, 0);
> > +while (futex(ev, FUTEX_WAIT, (int) val, NULL, NULL, 0)) {
> > +switch (errno) {
> > +case EWOULDBLOCK:
> > +return;
> > +case EINTR:
> > +break; /* get out of switch and retry */
> > +default:
> > +abort();
> 
> I'd be tempted to error_exit with the errno in this case so additional
> information is reported before we bail out.

Yes that's a good suggestion.

> The man pages seems to indicate other errnos are possible for FUTUX_WAIT
> although they may be unlikely:
> 
>EACCES No read access to futex memory.
>EFAULT Error retrieving timeout information from user space.
> 
>I guess things would have gone very wrong for these
> 
>EINVAL Invalid argument.
> 
>Hard to get wrong
> 
>ENFILE The system limit on the total number of open files has
>been reached.
> 
>Might happen under system load?
> 
>ENOSYS Invalid operation specified in op.
> 
>Hardcoded op so no
> 
>ETIMEDOUT
>   Timeout during the FUTEX_WAIT operation.
> 
>No timeout specified so we shouldn't hit it

Of these I'd say all would be bugs in our code except for ENFILE, so
it might be worth adding.

Thanks,

Emilio

Re: [Qemu-devel] [RFC 13/38] cputlb: add physical address to CPUTLBEntry

2015-09-10 Thread Emilio G. Cota

On Thu, Sep 10, 2015 at 14:49:07 +0100, Alex Bennée wrote:
> Emilio G. Cota  writes:
> 
> > Having the physical address in the TLB entry will allow us
> > to portably obtain the physical address of a memory access,
> > which will prove useful when implementing a scalable emulation
> > of atomic instructions.
> >
> > Signed-off-by: Emilio G. Cota 
> > ---
> >  cputlb.c| 1 +
> >  include/exec/cpu-defs.h | 7 ---
> >  2 files changed, 5 insertions(+), 3 deletions(-)
> >
> > diff --git a/cputlb.c b/cputlb.c
> > index d1ad8e8..1b3673e 100644
> > --- a/cputlb.c
> > +++ b/cputlb.c
> > @@ -409,6 +409,7 @@ void tlb_set_page_with_attrs(CPUState *cpu, 
> > target_ulong vaddr,
> >  } else {
> >  te->addr_write = -1;
> >  }
> > +te->addr_phys = paddr;
> >  }
> >  
> >  /* Add a new TLB entry, but without specifying the memory
> > diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
> > index 5093be2..ca9c85c 100644
> > --- a/include/exec/cpu-defs.h
> > +++ b/include/exec/cpu-defs.h
> > @@ -60,10 +60,10 @@ typedef uint64_t target_ulong;
> >  /* use a fully associative victim tlb of 8 entries */
> >  #define CPU_VTLB_SIZE 8
> >  
> > -#if HOST_LONG_BITS == 32 && TARGET_LONG_BITS == 32
> > -#define CPU_TLB_ENTRY_BITS 4
> > -#else
> > +#if TARGET_LONG_BITS == 32
> >  #define CPU_TLB_ENTRY_BITS 5
> > +#else
> > +#define CPU_TLB_ENTRY_BITS 6
> >  #endif
> >  
> >  /* TCG_TARGET_TLB_DISPLACEMENT_BITS is used in CPU_TLB_BITS to ensure that
> > @@ -110,6 +110,7 @@ typedef struct CPUTLBEntry {
> >  target_ulong addr_read;
> >  target_ulong addr_write;
> >  target_ulong addr_code;
> > +target_ulong addr_phys;
> >  /* Addend to virtual address to get host address.  IO accesses
> > use the corresponding iotlb value.  */
> >  uintptr_t addend;
> 
> So this ends up expanding the TLB entry size and either pushing the
> overall TLB up in size or reducing the number of entries per-TLB so I
> think we would need some numbers on the impact on performance this has.

My tests show little to no perf impact, but note that I work on fairly big
machines that have large caches. I could run some benchmarks, although
this will take a while--will be on vacation for a couple of weeks.

> As far as I can see you never use this value in this patch series so
> maybe this is worth deferring for now?

This is used for atomic instruction emulation in full-system mode; see
aie-helper.c which has calls to helpers added to softmmu_template.h.

Emilio

Re: [Qemu-devel] [RFC 15/38] radix-tree: add generic lockless radix tree module

2015-09-10 Thread Emilio G. Cota

On Thu, Sep 10, 2015 at 15:25:50 +0100, Alex Bennée wrote:
> 
> Emilio G. Cota  writes:
> 
> > This will be used by atomic instruction emulation code.
> 
> If we are adding utility functions into the code base like this (which I
> can see being useful) we should at least add some documentation with
> example calling conventions to docs/

Ack.

> Have you any performance numbers comparing the efficiency of the radix
> approach to a "dumb" implementation?

I tried a few lazy solutions (meaning having to write less code, which
isn't that dumb), such as having a g_hash_table protected by a mutex. But
this means potential contention on that mutex, so that's not a better option.

Whatever the solution might be, this is a slow path (only invoked
on atomic ops or regular stores that affect lines previously accessed
with atomic ops) so scalability is a more important goal than absolute speed.

The big perf impact comes on x86 when instrumenting stores, which is
orthogonal to this; invoking a helper on each store is a huge perf
killer for some benchmarks. It's OK for others though, but I still think
we should be able to do better.

Thanks,

Emilio

Re: [Qemu-devel] [RFC 19/38] tcg: add tcg_gen_smp_rmb()

2015-09-10 Thread Emilio G. Cota

On Thu, Sep 10, 2015 at 17:01:14 +0100, Alex Bennée wrote:
> 
> Emilio G. Cota  writes:
> 
> > Signed-off-by: Emilio G. Cota 
> > ---
> >  tcg/tcg-op.h | 10 ++
> >  1 file changed, 10 insertions(+)
> >
> > diff --git a/tcg/tcg-op.h b/tcg/tcg-op.h
> > index 52482c0..3ec9f13 100644
> > --- a/tcg/tcg-op.h
> > +++ b/tcg/tcg-op.h
> > @@ -716,6 +716,16 @@ static inline void tcg_gen_fence_full(void)
> >  tcg_gen_op0(&tcg_ctx, INDEX_op_fence_full);
> >  }
> >  
> > +#if defined(__i386__) || defined(__x86_64__) || defined(__s390x__)
> > +static inline void tcg_gen_smp_rmb(void)
> > +{ }
> > +#else
> > +static inline void tcg_gen_smp_rmb(void)
> > +{
> > +tcg_gen_fence_load();
> > +}
> > +#endif
> 
> This seems a little pointless wrapping up tcg_gen_fence_load. Could the
> magic dealing with the backend not be done with something like
> TCG_TARGET_HAS_fence_load. On the x86/x86_64 backends this could then
> NOP away.

This patch made sense at the time as a companion to this other patch:

cpu: add barriers around cpu->tcg_exit_req
(snip)
+++ b/include/exec/gen-icount.h 



@@ -16,6 +16,7 @@ static inline void gen_tb_start(TranslationBlock *tb) 







 exitreq_label = gen_new_label();   



 flag = tcg_temp_new_i32(); 



+tcg_gen_smp_rmb(); 



 tcg_gen_ld_i32(flag, cpu_env,  



offsetof(CPUState, tcg_exit_req) - ENV_OFFSET); 



 tcg_gen_brcondi_i32(TCG_COND_NE, flag, 0, exitreq_label);

Paolo had the better idea of calling smp_rmb() once we've
exited TCG, which renders tcg_gen_smp_rmb() unnecessary.

Emilio

[Qemu-devel] [RFC 10/38] translate-all: remove obsolete comment about l1_map

2015-08-23 Thread Emilio G. Cota

l1_map is based on physical addresses in full-system mode, as pointed
out in an earlier comment. Said comment also mentions that virtual
addresses are only used in l1_map in user-only mode.

Signed-off-by: Emilio G. Cota 
---
 translate-all.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/translate-all.c b/translate-all.c
index 31239db..b873d5c 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -122,8 +122,7 @@ uintptr_t qemu_real_host_page_mask;
 uintptr_t qemu_host_page_size;
 uintptr_t qemu_host_page_mask;
 
-/* This is a multi-level map on the virtual address space.
-   The bottom level has pointers to PageDesc.  */
+/* The bottom level has pointers to PageDesc */
 static void *l1_map[V_L1_SIZE];
 
 /* code generation context */
-- 
1.9.1

[Qemu-devel] [RFC 01/38] cpu-exec: add missing mmap_lock in tb_find_slow

2015-08-23 Thread Emilio G. Cota

Signed-off-by: Emilio G. Cota 
---
 cpu-exec.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/cpu-exec.c b/cpu-exec.c
index f53475c..b8a11e1 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -330,6 +330,7 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
 if (!tb) {
 tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
 }
+mmap_unlock();
 }
 
 /* we add the TB in the virtual pc hash table */
-- 
1.9.1

[Qemu-devel] [RFC 13/38] cputlb: add physical address to CPUTLBEntry

2015-08-23 Thread Emilio G. Cota

Having the physical address in the TLB entry will allow us
to portably obtain the physical address of a memory access,
which will prove useful when implementing a scalable emulation
of atomic instructions.

Signed-off-by: Emilio G. Cota 
---
 cputlb.c| 1 +
 include/exec/cpu-defs.h | 7 ---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index d1ad8e8..1b3673e 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -409,6 +409,7 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong 
vaddr,
 } else {
 te->addr_write = -1;
 }
+te->addr_phys = paddr;
 }
 
 /* Add a new TLB entry, but without specifying the memory
diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index 5093be2..ca9c85c 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -60,10 +60,10 @@ typedef uint64_t target_ulong;
 /* use a fully associative victim tlb of 8 entries */
 #define CPU_VTLB_SIZE 8
 
-#if HOST_LONG_BITS == 32 && TARGET_LONG_BITS == 32
-#define CPU_TLB_ENTRY_BITS 4
-#else
+#if TARGET_LONG_BITS == 32
 #define CPU_TLB_ENTRY_BITS 5
+#else
+#define CPU_TLB_ENTRY_BITS 6
 #endif
 
 /* TCG_TARGET_TLB_DISPLACEMENT_BITS is used in CPU_TLB_BITS to ensure that
@@ -110,6 +110,7 @@ typedef struct CPUTLBEntry {
 target_ulong addr_read;
 target_ulong addr_write;
 target_ulong addr_code;
+target_ulong addr_phys;
 /* Addend to virtual address to get host address.  IO accesses
use the corresponding iotlb value.  */
 uintptr_t addend;
-- 
1.9.1

[Qemu-devel] [RFC 06/38] seqlock: add missing 'inline' to seqlock_read_retry

2015-08-23 Thread Emilio G. Cota

Signed-off-by: Emilio G. Cota 
---
 include/qemu/seqlock.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/qemu/seqlock.h b/include/qemu/seqlock.h
index 3ff118a..f1256f5 100644
--- a/include/qemu/seqlock.h
+++ b/include/qemu/seqlock.h
@@ -62,7 +62,7 @@ static inline unsigned seqlock_read_begin(QemuSeqLock *sl)
 return ret;
 }
 
-static int seqlock_read_retry(const QemuSeqLock *sl, unsigned start)
+static inline int seqlock_read_retry(const QemuSeqLock *sl, unsigned start)
 {
 /* Read other fields before reading final sequence.  */
 smp_rmb();
-- 
1.9.1

[Qemu-devel] [RFC 08/38] rcu: init rcu_registry_lock after fork

2015-08-23 Thread Emilio G. Cota

We were unlocking this lock after fork, which is wrong since
only the thread that holds a mutex is allowed to unlock it.

Signed-off-by: Emilio G. Cota 
---
 util/rcu.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/util/rcu.c b/util/rcu.c
index 8ba304d..47c2bce 100644
--- a/util/rcu.c
+++ b/util/rcu.c
@@ -335,6 +335,11 @@ static void rcu_init_unlock(void)
 qemu_mutex_unlock(&rcu_registry_lock);
 qemu_mutex_unlock(&rcu_sync_lock);
 }
+
+static void rcu_init_child(void)
+{
+qemu_mutex_init(&rcu_registry_lock);
+}
 #endif
 
 void rcu_after_fork(void)
@@ -346,7 +351,7 @@ void rcu_after_fork(void)
 static void __attribute__((__constructor__)) rcu_init(void)
 {
 #ifdef CONFIG_POSIX
-pthread_atfork(rcu_init_lock, rcu_init_unlock, rcu_init_unlock);
+pthread_atfork(rcu_init_lock, rcu_init_unlock, rcu_init_child);
 #endif
 rcu_init_complete();
 }
-- 
1.9.1

[Qemu-devel] [RFC 07/38] seqlock: read sequence number atomically

2015-08-23 Thread Emilio G. Cota

With this change we make sure that the compiler will not
optimise the read of the sequence number in any way.

Signed-off-by: Emilio G. Cota 
---
 include/qemu/seqlock.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/qemu/seqlock.h b/include/qemu/seqlock.h
index f1256f5..70b01fd 100644
--- a/include/qemu/seqlock.h
+++ b/include/qemu/seqlock.h
@@ -55,18 +55,18 @@ static inline void seqlock_write_unlock(QemuSeqLock *sl)
 static inline unsigned seqlock_read_begin(QemuSeqLock *sl)
 {
 /* Always fail if a write is in progress.  */
-unsigned ret = sl->sequence & ~1;
+unsigned ret = atomic_read(&sl->sequence);
 
 /* Read sequence before reading other fields.  */
 smp_rmb();
-return ret;
+return ret & ~1;
 }
 
 static inline int seqlock_read_retry(const QemuSeqLock *sl, unsigned start)
 {
 /* Read other fields before reading final sequence.  */
 smp_rmb();
-return unlikely(sl->sequence != start);
+return unlikely(atomic_read(&sl->sequence) != start);
 }
 
 #endif
-- 
1.9.1

[Qemu-devel] [RFC 04/38] translate-all: remove volatile from have_tb_lock

2015-08-23 Thread Emilio G. Cota

This is a thread-local variable and therefore all changes
to it will be seen in order by the owning thread. There is
no need for it to be volatile.

Signed-off-by: Emilio G. Cota 
---
 translate-all.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/translate-all.c b/translate-all.c
index 901a35e..31239db 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -130,7 +130,7 @@ static void *l1_map[V_L1_SIZE];
 TCGContext tcg_ctx;
 
 /* translation block context */
-__thread volatile int have_tb_lock;
+__thread int have_tb_lock;
 
 void tb_lock(void)
 {
-- 
1.9.1

[Qemu-devel] [RFC 03/38] cpu-exec: set current_cpu at cpu_exec()

2015-08-23 Thread Emilio G. Cota

So that it applies to usermode as well.

Signed-off-by: Emilio G. Cota 
---
 cpu-exec.c | 2 ++
 cpus.c | 1 -
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index b8a11e1..2b9a447 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -386,6 +386,8 @@ int cpu_exec(CPUState *cpu)
 uintptr_t next_tb;
 SyncClocks sc;
 
+current_cpu = cpu;
+
 #ifndef CONFIG_USER_ONLY
 /* FIXME: user-mode emulation probably needs a similar mechanism as well,
  * for example for tb_flush.
diff --git a/cpus.c b/cpus.c
index 5484ce6..0fe6576 100644
--- a/cpus.c
+++ b/cpus.c
@@ -1079,7 +1079,6 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
 cpu->thread_id = qemu_get_thread_id();
 cpu->created = true;
 cpu->can_do_io = 1;
-current_cpu = cpu;
 
 qemu_cond_signal(&qemu_cpu_cond);
 
-- 
1.9.1

[Qemu-devel] [RFC 18/38] tcg: add fences

2015-08-24 Thread Emilio G. Cota

Signed-off-by: Emilio G. Cota 
---
 tcg/tcg-op.c  |  5 +
 tcg/tcg-op.h  | 18 ++
 tcg/tcg-opc.h |  5 +
 3 files changed, 28 insertions(+)

diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
index 45098c3..6d5b1df 100644
--- a/tcg/tcg-op.c
+++ b/tcg/tcg-op.c
@@ -57,6 +57,11 @@ static void tcg_emit_op(TCGContext *ctx, TCGOpcode opc, int 
args)
 };
 }
 
+void tcg_gen_op0(TCGContext *ctx, TCGOpcode opc)
+{
+tcg_emit_op(ctx, opc, -1);
+}
+
 void tcg_gen_op1(TCGContext *ctx, TCGOpcode opc, TCGArg a1)
 {
 int pi = ctx->gen_next_parm_idx;
diff --git a/tcg/tcg-op.h b/tcg/tcg-op.h
index d1d763f..52482c0 100644
--- a/tcg/tcg-op.h
+++ b/tcg/tcg-op.h
@@ -28,6 +28,7 @@
 
 /* Basic output routines.  Not for general consumption.  */
 
+void tcg_gen_op0(TCGContext *, TCGOpcode);
 void tcg_gen_op1(TCGContext *, TCGOpcode, TCGArg);
 void tcg_gen_op2(TCGContext *, TCGOpcode, TCGArg, TCGArg);
 void tcg_gen_op3(TCGContext *, TCGOpcode, TCGArg, TCGArg, TCGArg);
@@ -698,6 +699,23 @@ static inline void tcg_gen_trunc_i64_i32(TCGv_i32 ret, 
TCGv_i64 arg)
 tcg_gen_trunc_shr_i64_i32(ret, arg, 0);
 }
 
+/* fences */
+
+static inline void tcg_gen_fence_load(void)
+{
+tcg_gen_op0(&tcg_ctx, INDEX_op_fence_load);
+}
+
+static inline void tcg_gen_fence_store(void)
+{
+tcg_gen_op0(&tcg_ctx, INDEX_op_fence_store);
+}
+
+static inline void tcg_gen_fence_full(void)
+{
+tcg_gen_op0(&tcg_ctx, INDEX_op_fence_full);
+}
+
 /* QEMU specific operations.  */
 
 #ifndef TARGET_LONG_BITS
diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
index 13ccb60..85de953 100644
--- a/tcg/tcg-opc.h
+++ b/tcg/tcg-opc.h
@@ -167,6 +167,11 @@ DEF(muls2_i64, 2, 2, 0, IMPL64 | 
IMPL(TCG_TARGET_HAS_muls2_i64))
 DEF(muluh_i64, 1, 2, 0, IMPL(TCG_TARGET_HAS_muluh_i64))
 DEF(mulsh_i64, 1, 2, 0, IMPL(TCG_TARGET_HAS_mulsh_i64))
 
+/* fences */
+DEF(fence_load, 0, 0, 0, 0)
+DEF(fence_store, 0, 0, 0, 0)
+DEF(fence_full, 0, 0, 0, 0)
+
 /* QEMU specific */
 #if TARGET_LONG_BITS > TCG_TARGET_REG_BITS
 DEF(debug_insn_start, 0, 0, 2, TCG_OPF_NOT_PRESENT)
-- 
1.9.1

[Qemu-devel] [RFC 29/38] tcg: export have_tb_lock

2015-08-24 Thread Emilio G. Cota

Signed-off-by: Emilio G. Cota 
---
 tcg/tcg.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tcg/tcg.h b/tcg/tcg.h
index 8d30d61..9a873ac 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -599,6 +599,7 @@ void tb_lock(void);
 void tb_unlock(void);
 bool tb_lock_recursive(void);
 void tb_lock_reset(void);
+extern __thread int have_tb_lock;
 
 /* Called with tb_lock held.  */
 static inline void *tcg_malloc(int size)
-- 
1.9.1

[Qemu-devel] [RFC 05/38] thread-posix: inline qemu_spin functions

2015-08-24 Thread Emilio G. Cota

On some parallel workloads this gives up to a 15% speed improvement.

Signed-off-by: Emilio G. Cota 
---
 include/qemu/thread-posix.h | 47 ++
 include/qemu/thread.h   |  6 --
 util/qemu-thread-posix.c| 50 +
 3 files changed, 52 insertions(+), 51 deletions(-)

diff --git a/include/qemu/thread-posix.h b/include/qemu/thread-posix.h
index 8ce8f01..7d3a9f1 100644
--- a/include/qemu/thread-posix.h
+++ b/include/qemu/thread-posix.h
@@ -37,4 +37,51 @@ struct QemuThread {
 pthread_t thread;
 };
 
+void qemu_spin_error_exit(int err, const char *msg);
+
+static inline void qemu_spin_init(QemuSpin *spin)
+{
+int err;
+
+err = pthread_spin_init(&spin->lock, 0);
+if (err) {
+qemu_spin_error_exit(err, __func__);
+}
+}
+
+static inline void qemu_spin_destroy(QemuSpin *spin)
+{
+int err;
+
+err = pthread_spin_destroy(&spin->lock);
+if (err) {
+qemu_spin_error_exit(err, __func__);
+}
+}
+
+static inline void qemu_spin_lock(QemuSpin *spin)
+{
+int err;
+
+err = pthread_spin_lock(&spin->lock);
+if (err) {
+qemu_spin_error_exit(err, __func__);
+}
+}
+
+static inline int qemu_spin_trylock(QemuSpin *spin)
+{
+return pthread_spin_trylock(&spin->lock);
+}
+
+static inline void qemu_spin_unlock(QemuSpin *spin)
+{
+int err;
+
+err = pthread_spin_unlock(&spin->lock);
+if (err) {
+qemu_spin_error_exit(err, __func__);
+}
+}
+
 #endif
diff --git a/include/qemu/thread.h b/include/qemu/thread.h
index f5d1259..003daab 100644
--- a/include/qemu/thread.h
+++ b/include/qemu/thread.h
@@ -26,12 +26,6 @@ void qemu_mutex_lock(QemuMutex *mutex);
 int qemu_mutex_trylock(QemuMutex *mutex);
 void qemu_mutex_unlock(QemuMutex *mutex);
 
-void qemu_spin_init(QemuSpin *spin);
-void qemu_spin_destroy(QemuSpin *spin);
-void qemu_spin_lock(QemuSpin *spin);
-int qemu_spin_trylock(QemuSpin *spin);
-void qemu_spin_unlock(QemuSpin *spin);
-
 void qemu_cond_init(QemuCond *cond);
 void qemu_cond_destroy(QemuCond *cond);
 
diff --git a/util/qemu-thread-posix.c b/util/qemu-thread-posix.c
index 224bacc..04dae0f 100644
--- a/util/qemu-thread-posix.c
+++ b/util/qemu-thread-posix.c
@@ -48,6 +48,11 @@ static void error_exit(int err, const char *msg)
 abort();
 }
 
+void qemu_spin_error_exit(int err, const char *msg)
+{
+error_exit(err, msg);
+}
+
 void qemu_mutex_init(QemuMutex *mutex)
 {
 int err;
@@ -89,51 +94,6 @@ void qemu_mutex_unlock(QemuMutex *mutex)
 error_exit(err, __func__);
 }
 
-void qemu_spin_init(QemuSpin *spin)
-{
-int err;
-
-err = pthread_spin_init(&spin->lock, 0);
-if (err) {
-error_exit(err, __func__);
-}
-}
-
-void qemu_spin_destroy(QemuSpin *spin)
-{
-int err;
-
-err = pthread_spin_destroy(&spin->lock);
-if (err) {
-error_exit(err, __func__);
-}
-}
-
-void qemu_spin_lock(QemuSpin *spin)
-{
-int err;
-
-err = pthread_spin_lock(&spin->lock);
-if (err) {
-error_exit(err, __func__);
-}
-}
-
-int qemu_spin_trylock(QemuSpin *spin)
-{
-return pthread_spin_trylock(&spin->lock);
-}
-
-void qemu_spin_unlock(QemuSpin *spin)
-{
-int err;
-
-err = pthread_spin_unlock(&spin->lock);
-if (err) {
-error_exit(err, __func__);
-}
-}
-
 void qemu_cond_init(QemuCond *cond)
 {
 int err;
-- 
1.9.1

[Qemu-devel] [RFC 02/38] hw/i386/kvmvapic: add missing include of tcg.h

2015-08-24 Thread Emilio G. Cota

So that the declaration of tb_lock can be found.

Signed-off-by: Emilio G. Cota 
---
 hw/i386/kvmvapic.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/i386/kvmvapic.c b/hw/i386/kvmvapic.c
index 1c3b5b6..a9a33fd 100644
--- a/hw/i386/kvmvapic.c
+++ b/hw/i386/kvmvapic.c
@@ -13,6 +13,7 @@
 #include "sysemu/kvm.h"
 #include "hw/i386/apic_internal.h"
 #include "hw/sysbus.h"
+#include "tcg/tcg.h"
 
 #define VAPIC_IO_PORT   0x7e
 
-- 
1.9.1

[Qemu-devel] [RFC 19/38] tcg: add tcg_gen_smp_rmb()

2015-08-24 Thread Emilio G. Cota

Signed-off-by: Emilio G. Cota 
---
 tcg/tcg-op.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/tcg/tcg-op.h b/tcg/tcg-op.h
index 52482c0..3ec9f13 100644
--- a/tcg/tcg-op.h
+++ b/tcg/tcg-op.h
@@ -716,6 +716,16 @@ static inline void tcg_gen_fence_full(void)
 tcg_gen_op0(&tcg_ctx, INDEX_op_fence_full);
 }
 
+#if defined(__i386__) || defined(__x86_64__) || defined(__s390x__)
+static inline void tcg_gen_smp_rmb(void)
+{ }
+#else
+static inline void tcg_gen_smp_rmb(void)
+{
+tcg_gen_fence_load();
+}
+#endif
+
 /* QEMU specific operations.  */
 
 #ifndef TARGET_LONG_BITS
-- 
1.9.1

[Qemu-devel] [RFC 12/38] linux-user: call rcu_(un)register_thread on pthread_(exit|create)

2015-08-24 Thread Emilio G. Cota

Signed-off-by: Emilio G. Cota 
---
 linux-user/syscall.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index f62c698..732936f 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -4513,6 +4513,7 @@ static void *clone_func(void *arg)
 CPUState *cpu;
 TaskState *ts;
 
+rcu_register_thread();
 env = info->env;
 cpu = ENV_GET_CPU(env);
 thread_cpu = cpu;
@@ -5614,6 +5615,7 @@ abi_long do_syscall(void *cpu_env, int num, abi_long arg1,
 thread_cpu = NULL;
 object_unref(OBJECT(cpu));
 g_free(ts);
+rcu_unregister_thread();
 pthread_exit(NULL);
 }
 #ifdef TARGET_GPROF
-- 
1.9.1

[Qemu-devel] [RFC 14/38] softmmu: add helpers to get ld/st physical addresses

2015-08-24 Thread Emilio G. Cota

This will be used by the atomic instruction emulation code.

Signed-off-by: Emilio G. Cota 
---
 softmmu_template.h | 48 
 tcg/tcg.h  |  5 +
 2 files changed, 53 insertions(+)

diff --git a/softmmu_template.h b/softmmu_template.h
index b66eaf8..6496a8a 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -480,6 +480,54 @@ void helper_le_st_name(CPUArchState *env, target_ulong 
addr, DATA_TYPE val,
 #endif
 }
 
+#if DATA_SIZE == 1
+
+/* get a load's physical address */
+hwaddr helper_ret_get_ld_phys(CPUArchState *env, target_ulong addr,
+  int mmu_idx, uintptr_t retaddr)
+{
+int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+CPUTLBEntry *te = &env->tlb_table[mmu_idx][index];
+target_ulong taddr;
+target_ulong phys_addr;
+
+retaddr -= GETPC_ADJ;
+taddr = te->addr_read & (TARGET_PAGE_MASK | TLB_INVALID_MASK);
+if (taddr != (addr & TARGET_PAGE_MASK)) {
+if (!VICTIM_TLB_HIT(addr_read)) {
+CPUState *cs = ENV_GET_CPU(env);
+
+tlb_fill(cs, addr, MMU_DATA_LOAD, mmu_idx, retaddr);
+}
+}
+phys_addr = te->addr_phys;
+return phys_addr | (addr & ~TARGET_PAGE_MASK);
+}
+
+/* get a store's physical address */
+hwaddr helper_ret_get_st_phys(CPUArchState *env, target_ulong addr,
+  int mmu_idx, uintptr_t retaddr)
+{
+int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+CPUTLBEntry *te = &env->tlb_table[mmu_idx][index];
+target_ulong taddr;
+target_ulong phys_addr;
+
+retaddr -= GETPC_ADJ;
+taddr = te->addr_write & (TARGET_PAGE_MASK | TLB_INVALID_MASK);
+if (taddr != (addr & TARGET_PAGE_MASK)) {
+if (!VICTIM_TLB_HIT(addr_write)) {
+CPUState *cs = ENV_GET_CPU(env);
+
+tlb_fill(cs, addr, MMU_DATA_STORE, mmu_idx, retaddr);
+}
+}
+phys_addr = te->addr_phys;
+return phys_addr | (addr & ~TARGET_PAGE_MASK);
+}
+
+#endif /* DATA_SIZE == 1 */
+
 #if DATA_SIZE > 1
 void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
TCGMemOpIdx oi, uintptr_t retaddr)
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 66b36f2..8d30d61 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -992,6 +992,11 @@ void helper_be_stl_mmu(CPUArchState *env, target_ulong 
addr, uint32_t val,
 void helper_be_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
TCGMemOpIdx oi, uintptr_t retaddr);
 
+hwaddr helper_ret_get_ld_phys(CPUArchState *env, target_ulong addr,
+  int mmu_idx, uintptr_t retaddr);
+hwaddr helper_ret_get_st_phys(CPUArchState *env, target_ulong addr,
+  int mmu_idx, uintptr_t retaddr);
+
 /* Temporary aliases until backends are converted.  */
 #ifdef TARGET_WORDS_BIGENDIAN
 # define helper_ret_ldsw_mmu  helper_be_ldsw_mmu
-- 
1.9.1

[Qemu-devel] [RFC 11/38] qemu-thread: handle spurious futex_wait wakeups

2015-08-24 Thread Emilio G. Cota

Signed-off-by: Emilio G. Cota 
---
 util/qemu-thread-posix.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/util/qemu-thread-posix.c b/util/qemu-thread-posix.c
index 04dae0f..3760e27 100644
--- a/util/qemu-thread-posix.c
+++ b/util/qemu-thread-posix.c
@@ -303,7 +303,16 @@ static inline void futex_wake(QemuEvent *ev, int n)
 
 static inline void futex_wait(QemuEvent *ev, unsigned val)
 {
-futex(ev, FUTEX_WAIT, (int) val, NULL, NULL, 0);
+while (futex(ev, FUTEX_WAIT, (int) val, NULL, NULL, 0)) {
+switch (errno) {
+case EWOULDBLOCK:
+return;
+case EINTR:
+break; /* get out of switch and retry */
+default:
+abort();
+}
+}
 }
 #else
 static inline void futex_wake(QemuEvent *ev, int n)
-- 
1.9.1

[Qemu-devel] [RFC 31/38] cpu: protect l1_map with tb_lock in full-system mode

2015-08-24 Thread Emilio G. Cota

Note that user-only uses mmap_lock for this.

Signed-off-by: Emilio G. Cota 
---
 translate-all.c | 31 +++
 1 file changed, 23 insertions(+), 8 deletions(-)

diff --git a/translate-all.c b/translate-all.c
index e7b4a31..8f8c402 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -1203,8 +1203,9 @@ void tb_invalidate_phys_range(tb_page_addr_t start, 
tb_page_addr_t end)
  * Called with mmap_lock held for user-mode emulation
  * If called from generated code, iothread mutex must not be held.
  */
-void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
-   int is_cpu_write_access)
+static void
+tb_invalidate_phys_page_range_locked(tb_page_addr_t start, tb_page_addr_t end,
+ int is_cpu_write_access)
 {
 TranslationBlock *tb, *tb_next, *saved_tb;
 CPUState *cpu = current_cpu;
@@ -1236,7 +1237,6 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, 
tb_page_addr_t end,
 /* we remove all the TBs in the range [start, end[ */
 /* XXX: see if in some cases it could be faster to invalidate all
the code */
-tb_lock();
 tb = p->first_tb;
 while (tb != NULL) {
 n = (uintptr_t)tb & 3;
@@ -1310,14 +1310,19 @@ void tb_invalidate_phys_page_range(tb_page_addr_t 
start, tb_page_addr_t end,
 cpu_resume_from_signal(cpu, NULL);
 }
 #endif
+}
+
+void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
+   int is_cpu_write_access)
+{
+tb_lock();
+tb_invalidate_phys_page_range_locked(start, end, is_cpu_write_access);
 tb_unlock();
 }
 
 #ifdef CONFIG_SOFTMMU
-/* len must be <= 8 and start must be a multiple of len.
- * Called via softmmu_template.h, with iothread mutex not held.
- */
-void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len)
+
+static void tb_invalidate_phys_page_fast_locked(tb_page_addr_t start, int len)
 {
 PageDesc *p;
 
@@ -1352,9 +1357,19 @@ void tb_invalidate_phys_page_fast(tb_page_addr_t start, 
int len)
 }
 } else {
 do_invalidate:
-tb_invalidate_phys_page_range(start, start + len, 1);
+tb_invalidate_phys_page_range_locked(start, start + len, 1);
 }
 }
+
+/* len must be <= 8 and start must be a multiple of len.
+ * Called via softmmu_template.h, with iothread mutex not held.
+ */
+void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len)
+{
+tb_lock();
+tb_invalidate_phys_page_fast_locked(start, len);
+tb_unlock();
+}
 #else
 /* Called with mmap_lock held.  */
 static void tb_invalidate_phys_page(tb_page_addr_t addr,
-- 
1.9.1

[Qemu-devel] [RFC 15/38] radix-tree: add generic lockless radix tree module

2015-08-24 Thread Emilio G. Cota

This will be used by atomic instruction emulation code.

Signed-off-by: Emilio G. Cota 
---
 include/qemu/radix-tree.h | 29 ++
 util/Makefile.objs|  2 +-
 util/radix-tree.c | 75 +++
 3 files changed, 105 insertions(+), 1 deletion(-)
 create mode 100644 include/qemu/radix-tree.h
 create mode 100644 util/radix-tree.c

diff --git a/include/qemu/radix-tree.h b/include/qemu/radix-tree.h
new file mode 100644
index 000..a4e1f97
--- /dev/null
+++ b/include/qemu/radix-tree.h
@@ -0,0 +1,29 @@
+#ifndef RADIX_TREE_H
+#define RADIX_TREE_H
+
+#include 
+
+typedef struct QemuRadixNode QemuRadixNode;
+typedef struct QemuRadixTree QemuRadixTree;
+
+struct QemuRadixNode {
+void *slots[0];
+};
+
+struct QemuRadixTree {
+QemuRadixNode *root;
+int radix;
+int max_height;
+};
+
+void qemu_radix_tree_init(QemuRadixTree *tree, int bits, int radix);
+void *qemu_radix_tree_find_alloc(QemuRadixTree *tree, unsigned long index,
+ void *(*create)(unsigned long),
+ void (*delete)(void *));
+
+static inline void *qemu_radix_tree_find(QemuRadixTree *t, unsigned long index)
+{
+return qemu_radix_tree_find_alloc(t, index, NULL, NULL);
+}
+
+#endif /* RADIX_TREE_H */
diff --git a/util/Makefile.objs b/util/Makefile.objs
index 114d657..6b18d3d 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -1,4 +1,4 @@
-util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o
+util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o radix-tree.o
 util-obj-$(CONFIG_WIN32) += oslib-win32.o qemu-thread-win32.o 
event_notifier-win32.o
 util-obj-$(CONFIG_POSIX) += oslib-posix.o qemu-thread-posix.o 
event_notifier-posix.o qemu-openpty.o
 util-obj-y += envlist.o path.o module.o
diff --git a/util/radix-tree.c b/util/radix-tree.c
new file mode 100644
index 000..69eff29
--- /dev/null
+++ b/util/radix-tree.c
@@ -0,0 +1,75 @@
+/*
+ * radix-tree.c
+ * Non-blocking radix tree.
+ *
+ * Features:
+ * - Concurrent lookups and inserts.
+ * - No support for deletions.
+ *
+ * Conventions:
+ * - Height is counted starting from 0 at the bottom.
+ * - The index is used from left to right, i.e. MSBs are used first. This way
+ *   nearby addresses land in nearby slots, minimising cache/TLB misses.
+ */
+#include 
+
+#include "qemu/radix-tree.h"
+#include "qemu/atomic.h"
+#include "qemu/bitops.h"
+#include "qemu/osdep.h"
+
+typedef struct QemuRadixNode QemuRadixNode;
+
+void *qemu_radix_tree_find_alloc(QemuRadixTree *tree, unsigned long index,
+ void *(*create)(unsigned long),
+ void (*delete)(void *))
+{
+QemuRadixNode *parent;
+QemuRadixNode *node = tree->root;
+void **slot;
+int n_slots = BIT(tree->radix);
+int level = tree->max_height - 1;
+int shift = (level - 1) * tree->radix;
+
+do {
+parent = node;
+slot = parent->slots + ((index >> shift) & (n_slots - 1));
+node = atomic_read(slot);
+smp_read_barrier_depends();
+if (node == NULL) {
+void *old;
+void *new;
+
+if (!create) {
+return NULL;
+}
+
+if (level == 1) {
+node = create(index);
+} else {
+node = g_malloc0(sizeof(*node) + sizeof(void *) * n_slots);
+}
+new = node;
+/* atomic_cmpxchg is type-safe so we cannot use 'node' here */
+old = atomic_cmpxchg(slot, NULL, new);
+if (old) {
+if (level == 1) {
+delete(node);
+} else {
+g_free(node);
+}
+node = old;
+}
+}
+shift -= tree->radix;
+level--;
+} while (level > 0);
+return node;
+}
+
+void qemu_radix_tree_init(QemuRadixTree *tree, int bits, int radix)
+{
+tree->radix = radix;
+tree->max_height = 1 + DIV_ROUND_UP(bits, radix);
+tree->root = g_malloc0(sizeof(*tree->root) + sizeof(void *) * BIT(radix));
+}
-- 
1.9.1

[Qemu-devel] [RFC 17/38] aie: add target helpers

2015-08-24 Thread Emilio G. Cota

Signed-off-by: Emilio G. Cota 
---
 aie-helper.c  | 112 ++
 include/exec/cpu-defs.h   |   5 +++
 include/qemu/aie-helper.h |   6 +++
 3 files changed, 123 insertions(+)
 create mode 100644 aie-helper.c
 create mode 100644 include/qemu/aie-helper.h

diff --git a/aie-helper.c b/aie-helper.c
new file mode 100644
index 000..7521150
--- /dev/null
+++ b/aie-helper.c
@@ -0,0 +1,112 @@
+/*
+ * To be included directly from the target's helper.c
+ */
+#include "qemu/aie.h"
+
+#ifdef CONFIG_USER_ONLY
+static inline
+hwaddr h_get_ld_phys(CPUArchState *env, target_ulong vaddr, uintptr_t retaddr)
+{
+return vaddr;
+}
+
+static inline
+hwaddr h_get_st_phys(CPUArchState *env, target_ulong vaddr, uintptr_t retaddr)
+{
+return vaddr;
+}
+#else
+static inline
+hwaddr h_get_ld_phys(CPUArchState *env, target_ulong vaddr, uintptr_t retaddr)
+{
+return helper_ret_get_ld_phys(env, vaddr, cpu_mmu_index(env), retaddr);
+}
+
+static inline
+hwaddr h_get_st_phys(CPUArchState *env, target_ulong vaddr, uintptr_t retaddr)
+{
+return helper_ret_get_st_phys(env, vaddr, cpu_mmu_index(env), retaddr);
+}
+#endif /* CONFIG_USER_ONLY */
+
+static inline void h_aie_lock(CPUArchState *env, hwaddr paddr)
+{
+AIEEntry *entry = aie_entry_get_lock(paddr);
+
+env->aie_entry = entry;
+env->aie_locked = true;
+}
+
+static inline void h_aie_unlock(CPUArchState *env)
+{
+assert(env->aie_entry && env->aie_locked);
+qemu_spin_unlock(&env->aie_entry->lock);
+env->aie_locked = false;
+}
+
+static inline void h_aie_unlock__done(CPUArchState *env)
+{
+h_aie_unlock(env);
+env->aie_entry = NULL;
+}
+
+static inline
+void aie_ld_lock_ret(CPUArchState *env, target_ulong vaddr, uintptr_t retaddr)
+{
+hwaddr paddr;
+
+assert(!env->aie_locked);
+paddr = h_get_ld_phys(env, vaddr, retaddr);
+h_aie_lock(env, paddr);
+}
+
+void HELPER(aie_ld_lock)(CPUArchState *env, target_ulong vaddr)
+{
+aie_ld_lock_ret(env, vaddr, GETRA());
+}
+
+static inline
+void aie_st_lock_ret(CPUArchState *env, target_ulong vaddr, uintptr_t retaddr)
+{
+hwaddr paddr;
+
+assert(!env->aie_locked);
+paddr = h_get_st_phys(env, vaddr, retaddr);
+h_aie_lock(env, paddr);
+}
+
+void HELPER(aie_unlock__done)(CPUArchState *env)
+{
+h_aie_unlock__done(env);
+}
+
+void HELPER(aie_ld_pre)(CPUArchState *env, target_ulong vaddr)
+{
+if (likely(!env->aie_lock_enabled) || env->aie_locked) {
+return;
+}
+aie_ld_lock_ret(env, vaddr, GETRA());
+}
+
+void HELPER(aie_st_pre)(CPUArchState *env, target_ulong vaddr)
+{
+if (unlikely(env->aie_lock_enabled)) {
+if (env->aie_locked) {
+return;
+}
+aie_st_lock_ret(env, vaddr, GETRA());
+} else {
+hwaddr paddr = h_get_st_phys(env, vaddr, GETRA());
+
+if (unlikely(aie_entry_exists(paddr))) {
+h_aie_lock(env, paddr);
+}
+}
+}
+
+void HELPER(aie_st_post)(CPUArchState *env, target_ulong vaddr)
+{
+if (unlikely(!env->aie_lock_enabled && env->aie_locked)) {
+h_aie_unlock__done(env);
+}
+}
diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index ca9c85c..e6e4568 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -28,6 +28,7 @@
 #include "qemu/osdep.h"
 #include "qemu/queue.h"
 #include "tcg-target.h"
+#include "qemu/aie.h"
 #ifndef CONFIG_USER_ONLY
 #include "exec/hwaddr.h"
 #endif
@@ -152,5 +153,9 @@ typedef struct CPUIOTLBEntry {
 #define CPU_COMMON  \
 /* soft mmu support */  \
 CPU_COMMON_TLB  \
+AIEEntry *aie_entry;\
+bool aie_locked;\
+bool aie_lock_enabled;  \
+bool aie_llsc_st_tracking;  \
 
 #endif
diff --git a/include/qemu/aie-helper.h b/include/qemu/aie-helper.h
new file mode 100644
index 000..86a786a
--- /dev/null
+++ b/include/qemu/aie-helper.h
@@ -0,0 +1,6 @@
+DEF_HELPER_2(aie_ld_pre, void, env, tl)
+DEF_HELPER_2(aie_st_pre, void, env, tl)
+DEF_HELPER_2(aie_st_post, void, env, tl)
+
+DEF_HELPER_2(aie_ld_lock, void, env, tl)
+DEF_HELPER_1(aie_unlock__done, void, env)
-- 
1.9.1

[Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode

2015-08-24 Thread Emilio G. Cota

Hi all,

Here is MTTCG code I've been working on out-of-tree for the last few months.

The patchset applies on top of pbonzini's mttcg branch, commit ca56de6f.
Fetch the branch from: https://github.com/bonzini/qemu/commits/mttcg

The highlights of the patchset are as follows:

- The first 5 patches are direct fixes to bugs only in the mttcg
  branch.

- Patches 6-12 fix issues in the master branch.

- The remaining patches are really the meat of this patchset.
  The main features are:

  * Support of MTTCG for both user and system mode.

  * Design: per-CPU TB jump list protected by a seqlock,
if the TB is not found there then check on the global, RCU-protected 'hash 
table'
(i.e. fixed number of buckets), if not there then grab lock, check again,
and if it's not there then add generate the code and add the TB to the hash 
table.

It makes sense that Paolo's recent work on the mttcg branch ended up
being almost identical to this--it's simple and it scales well.

  * tb_lock must be held every time code is generated. The rationale is
that most of the time QEMU is executing code, not generating it.

  * tb_flush: do it once all other CPUs have been put to sleep by calling
rcu_synchronize().
We also instrument tb_lock to make sure that only one tb_flush request can
happen at a given time.  For this a mechanism to schedule work is added to
supersede cpu_sched_safe_work, which cannot work in usermode.  Here I've
toyed with an alternative version that doesn't force the flushing CPU to
exit, but in order to make this work we have save/restore the RCU read
lock while tb_lock is held in order to avoid deadlocks. This isn't too
pretty but it's good to know that the option is there.

  * I focused on x86 since it is a complex ISA and we support many cores via 
-smp.
I work on a 64-core machine so concurrency bugs show up relatively easily.

Atomics are modeled using spinlocks, i.e. one host lock per guest cache 
line.
Note that spinlocks are way better than mutexes for this--perf on 64-cores
is 2X with spinlocks on highly concurrent workloads (synchrobench, see 
below).

Advantages:

+ Scalability. No unrelated atomics (e.g. atomics on the same page)
  can interfere with each other. Of course if the guest code
  has false sharing (i.e. atomics on the same cache line), then
  there's not much the host can do about that.
  This is an improved version over what I sent in May:
https://lists.gnu.org/archive/html/qemu-devel/2015-05/msg01641.html
  Performance numbers are below.

+ No requirements on the capabilities of the host machine, e.g.
  no need for a host cmpxchg instruction. That is, we'd have no problem
  running x86 code on a weaker host (say ARM/PPC) although of course we'd
  have to sprinkle quite a few memory barriers.  Note that the current
  MTTCG relies on cmpxchg(), which would be insufficient to run x86 code
  on ARM/PPC since that cmpxchg could very well race with a regular store
  (whereas in x86 it cannot).

+ Works unchanged for both system and user modes. As far as I can
  tell the TLB-based approach that Alvise is working on couldn't
  be used without the TLB--correct me if I'm wrong, it's been
  quite some time since I looked at that work.

Disadvantages:
- Overhead is added to every guest store. Depending on how frequent
  stores are, this can end up being significant single-threaded
  overhead (I've measured from a few % to up to ~50%).

  Note that this overhead applies to strong memory models such
  as x86, since the ISA can deal with concurrent stores and atomic
  instructions. Weaker memory models such as ARM/PPC's wouldn't have this
  overhead.

  * Performance
I've used four C/C++ benchmarks from synchrobench:
  https://github.com/gramoli/synchrobench
I'm running them with these arguments: -u 0 -f 1 -d 1 -t $n_threads
Here are two comparisons;
* usermode vs. native http://imgur.com/RggzgyU
* qemu-system vs qemu-KVM http://imgur.com/H9iH06B
(full-system is run with -m 4096).

Throughput is normalised for each of the four configurations over their
throughput with 1 thread.

For single-thread performance overhead of instrumenting writes I used
two apps from PARSEC, all of them with the 'large' input:

[Note that for the multithreaded tests I did not use PARSEC; it doesn't
 scale at all on large systems]

blackscholes 1 thread, ~8% of stores per instruction:
pbonzini/mttcg+Patches1-5:  62.922099012 seconds ( +-  0.05% )
+entire patchset:   67.680987626 seconds ( +-  0.35% )
That's about an 8% perf overhead.

swaptions 1 thread, ~7% of stores per instruction:
pbonzini/mttcg+Patches1-5:  144.542495834 seconds ( +-  0.49% )
+entire patchset:   157.673401200 seconds ( +-  0.25% )
That's about an 9% perf overhead

[Qemu-devel] [RFC 25/38] cpu: add barriers around cpu->tcg_exit_req

2015-08-24 Thread Emilio G. Cota

Signed-off-by: Emilio G. Cota 
---
 include/exec/gen-icount.h | 1 +
 translate-all.c   | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/include/exec/gen-icount.h b/include/exec/gen-icount.h
index 05d89d3..f429821 100644
--- a/include/exec/gen-icount.h
+++ b/include/exec/gen-icount.h
@@ -16,6 +16,7 @@ static inline void gen_tb_start(TranslationBlock *tb)
 
 exitreq_label = gen_new_label();
 flag = tcg_temp_new_i32();
+tcg_gen_smp_rmb();
 tcg_gen_ld_i32(flag, cpu_env,
offsetof(CPUState, tcg_exit_req) - ENV_OFFSET);
 tcg_gen_brcondi_i32(TCG_COND_NE, flag, 0, exitreq_label);
diff --git a/translate-all.c b/translate-all.c
index 12eaed7..76a0be8 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -1614,6 +1614,7 @@ static void tcg_handle_interrupt(CPUState *cpu, int mask)
 cpu_abort(cpu, "Raised interrupt while not in I/O function");
 }
 } else {
+smp_wmb();
 cpu->tcg_exit_req = 1;
 }
 }
@@ -1791,6 +1792,7 @@ void dump_opcount_info(FILE *f, fprintf_function 
cpu_fprintf)
 void cpu_interrupt(CPUState *cpu, int mask)
 {
 atomic_or(&cpu->interrupt_request, mask);
+smp_wmb();
 cpu->tcg_exit_req = 1;
 }
 
-- 
1.9.1

[Qemu-devel] [RFC 20/38] tcg/i386: implement fences

2015-08-24 Thread Emilio G. Cota

Signed-off-by: Emilio G. Cota 
---
 tcg/i386/tcg-target.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c
index 887f22f..6600c45 100644
--- a/tcg/i386/tcg-target.c
+++ b/tcg/i386/tcg-target.c
@@ -1123,6 +1123,13 @@ static void tcg_out_jmp(TCGContext *s, tcg_insn_unit 
*dest)
 tcg_out_branch(s, 0, dest);
 }
 
+static inline void tcg_out_fence(TCGContext *s, uint8_t op)
+{
+tcg_out8(s, 0x0f);
+tcg_out8(s, 0xae);
+tcg_out8(s, op);
+}
+
 #if defined(CONFIG_SOFTMMU)
 /* helper signature: helper_ret_ld_mmu(CPUState *env, target_ulong addr,
  * int mmu_idx, uintptr_t ra)
@@ -2088,6 +2095,16 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode 
opc,
 }
 break;
 
+case INDEX_op_fence_load:
+tcg_out_fence(s, 0xe8);
+break;
+case INDEX_op_fence_full:
+tcg_out_fence(s, 0xf0);
+break;
+case INDEX_op_fence_store:
+tcg_out_fence(s, 0xf8);
+break;
+
 case INDEX_op_mov_i32:  /* Always emitted via tcg_out_mov.  */
 case INDEX_op_mov_i64:
 case INDEX_op_movi_i32: /* Always emitted via tcg_out_movi.  */
@@ -2226,6 +2243,9 @@ static const TCGTargetOpDef x86_op_defs[] = {
 { INDEX_op_qemu_ld_i64, { "r", "r", "L", "L" } },
 { INDEX_op_qemu_st_i64, { "L", "L", "L", "L" } },
 #endif
+{ INDEX_op_fence_load, { } },
+{ INDEX_op_fence_store, { } },
+{ INDEX_op_fence_full, { } },
 { -1 },
 };
 
-- 
1.9.1

[Qemu-devel] [RFC 16/38] aie: add module for Atomic Instruction Emulation

2015-08-24 Thread Emilio G. Cota

Signed-off-by: Emilio G. Cota 
---
 Makefile.target|  1 +
 aie.c  | 57 ++
 include/qemu/aie.h | 49 ++
 translate-all.c|  2 ++
 4 files changed, 109 insertions(+)
 create mode 100644 aie.c
 create mode 100644 include/qemu/aie.h

diff --git a/Makefile.target b/Makefile.target
index 3e7aafd..840e257 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -85,6 +85,7 @@ all: $(PROGS) stap
 #
 # cpu emulator library
 obj-y = exec.o translate-all.o cpu-exec.o
+obj-y += aie.o
 obj-y += tcg/tcg.o tcg/tcg-op.o tcg/optimize.o
 obj-$(CONFIG_TCG_INTERPRETER) += tci.o
 obj-$(CONFIG_TCG_INTERPRETER) += disas/tci.o
diff --git a/aie.c b/aie.c
new file mode 100644
index 000..588c02b
--- /dev/null
+++ b/aie.c
@@ -0,0 +1,57 @@
+/*
+ * Atomic instruction emulation (AIE).
+ * This applies to LL/SC and higher-order atomic instructions.
+ * More info:
+ *   http://en.wikipedia.org/wiki/Load-link/store-conditional
+ */
+#include "qemu-common.h"
+#include "qemu/radix-tree.h"
+#include "qemu/thread.h"
+#include "qemu/aie.h"
+
+#if defined(CONFIG_USER_ONLY)
+# define AIE_FULL_ADDR_BITS  TARGET_VIRT_ADDR_SPACE_BITS
+#else
+#if HOST_LONG_BITS < TARGET_PHYS_ADDR_SPACE_BITS
+/* in this case QEMU restricts the maximum RAM size to fit in the host */
+# define AIE_FULL_ADDR_BITS  HOST_LONG_BITS
+#else
+# define AIE_FULL_ADDR_BITS  TARGET_PHYS_ADDR_SPACE_BITS
+#endif
+#endif /* CONFIG_USER_ONLY */
+
+#define AIE_ADDR_BITS  (AIE_FULL_ADDR_BITS - AIE_DISCARD_BITS)
+#define AIE_RADIX 8
+
+QemuRadixTree aie_rtree;
+unsigned long *aie_bm;
+
+static void *aie_entry_init(unsigned long index)
+{
+AIEEntry *entry;
+
+entry = qemu_memalign(64, sizeof(*entry));
+qemu_spin_init(&entry->lock);
+entry->bm_set = false;
+return entry;
+}
+
+AIEEntry *aie_entry_get_lock(hwaddr paddr)
+{
+aie_addr_t idx = to_aie(paddr);
+AIEEntry *e;
+
+e = qemu_radix_tree_find_alloc(&aie_rtree, idx, aie_entry_init, 
qemu_vfree);
+qemu_spin_lock(&e->lock);
+if (!e->bm_set) {
+set_bit_atomic(idx & (AIE_BM_NR_ITEMS - 1), aie_bm);
+e->bm_set = true;
+}
+return e;
+}
+
+void aie_init(void)
+{
+qemu_radix_tree_init(&aie_rtree, AIE_ADDR_BITS, AIE_RADIX);
+aie_bm = bitmap_new(AIE_BM_NR_ITEMS);
+}
diff --git a/include/qemu/aie.h b/include/qemu/aie.h
new file mode 100644
index 000..667f36c
--- /dev/null
+++ b/include/qemu/aie.h
@@ -0,0 +1,49 @@
+/*
+ * Atomic instruction emulation (AIE)
+ */
+#ifndef AIE_H
+#define AIE_H
+
+#include "qemu/radix-tree.h"
+#include "qemu/thread.h"
+#include "qemu/bitops.h"
+
+#include "exec/hwaddr.h"
+
+typedef hwaddr aie_addr_t;
+
+typedef struct AIEEntry AIEEntry;
+
+struct AIEEntry {
+union {
+struct {
+QemuSpin lock;
+bool bm_set;
+};
+uint8_t pad[64];
+};
+} __attribute((aligned(64)));
+
+#define AIE_DISCARD_BITS 6
+
+#define AIE_BM_BITS 21
+#define AIE_BM_NR_ITEMS BIT(AIE_BM_BITS)
+
+extern QemuRadixTree aie_rtree;
+extern unsigned long *aie_bm;
+
+static inline aie_addr_t to_aie(hwaddr paddr)
+{
+return paddr >> AIE_DISCARD_BITS;
+}
+
+void aie_init(void);
+
+AIEEntry *aie_entry_get_lock(hwaddr addr);
+
+static inline bool aie_entry_exists(hwaddr addr)
+{
+return test_bit(to_aie(addr) & (AIE_BM_NR_ITEMS - 1), aie_bm);
+}
+
+#endif /* AIE_H */
diff --git a/translate-all.c b/translate-all.c
index b873d5c..f07547e 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -62,6 +62,7 @@
 #include "translate-all.h"
 #include "qemu/bitmap.h"
 #include "qemu/timer.h"
+#include "qemu/aie.h"
 
 //#define DEBUG_TB_INVALIDATE
 //#define DEBUG_FLUSH
@@ -730,6 +731,7 @@ void tcg_exec_init(unsigned long tb_size)
 tcg_ctx.code_gen_ptr = tcg_ctx.code_gen_buffer;
 tcg_register_jit(tcg_ctx.code_gen_buffer, tcg_ctx.code_gen_buffer_size);
 page_init();
+aie_init();
 #if !defined(CONFIG_USER_ONLY) || !defined(CONFIG_USE_GUEST_BASE)
 /* There's no guest base to take into account, so go ahead and
initialize the prologue now.  */
-- 
1.9.1

[Qemu-devel] [RFC 09/38] rcu: fix comment with s/rcu_gp_lock/rcu_registry_lock/

2015-08-24 Thread Emilio G. Cota

Signed-off-by: Emilio G. Cota 
---
 include/qemu/rcu.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h
index 7df1e86..f6d1d56 100644
--- a/include/qemu/rcu.h
+++ b/include/qemu/rcu.h
@@ -71,7 +71,7 @@ struct rcu_reader_data {
 /* Data used by reader only */
 unsigned depth;
 
-/* Data used for registry, protected by rcu_gp_lock */
+/* Data used for registry, protected by rcu_registry_lock */
 QLIST_ENTRY(rcu_reader_data) node;
 };
 
-- 
1.9.1

[Qemu-devel] [RFC 23/38] cpu-exec: grab iothread lock during interrupt handling

2015-08-24 Thread Emilio G. Cota

Signed-off-by: Emilio G. Cota 
---
 cpu-exec.c| 34 --
 include/qom/cpu.h |  1 +
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index fd57b9c..a1700ac 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -371,6 +371,29 @@ static void cpu_handle_debug_exception(CPUState *cpu)
 cc->debug_excp_handler(cpu);
 }
 
+#ifdef CONFIG_SOFTMMU
+static inline void cpu_exit_loop_lock(CPUState *cpu)
+{
+qemu_mutex_lock_iothread();
+cpu->cpu_loop_exit_locked = true;
+}
+
+static inline void cpu_exit_loop_lock_reset(CPUState *cpu)
+{
+if (cpu->cpu_loop_exit_locked) {
+cpu->cpu_loop_exit_locked = false;
+qemu_mutex_unlock_iothread();
+}
+}
+
+#else
+static inline void cpu_exit_loop_lock(CPUState *cpu)
+{ }
+
+static inline void cpu_exit_loop_lock_reset(CPUState *cpu)
+{ }
+#endif
+
 /* main execution loop */
 
 int cpu_exec(CPUState *cpu)
@@ -452,12 +475,8 @@ int cpu_exec(CPUState *cpu)
 for(;;) {
 interrupt_request = cpu->interrupt_request;
 if (unlikely(interrupt_request)) {
-/* FIXME: this needs to take the iothread lock.
- * For this we need to find all places in
- * cc->cpu_exec_interrupt that can call cpu_loop_exit,
- * and call qemu_unlock_iothread_mutex() there.  Else,
- * add a flag telling cpu_loop_exit() to unlock it.
- */
+cpu_exit_loop_lock(cpu);
+
 if (unlikely(cpu->singlestep_enabled & SSTEP_NOIRQ)) {
 /* Mask out external interrupts for this step. */
 interrupt_request &= ~CPU_INTERRUPT_SSTEP_MASK;
@@ -503,6 +522,8 @@ int cpu_exec(CPUState *cpu)
the program flow was changed */
 next_tb = 0;
 }
+
+cpu_exit_loop_lock_reset(cpu);
 }
 if (unlikely(cpu->exit_request)) {
 cpu->exception_index = EXCP_INTERRUPT;
@@ -609,6 +630,7 @@ int cpu_exec(CPUState *cpu)
 env = &x86_cpu->env;
 #endif
 tb_lock_reset();
+cpu_exit_loop_lock_reset(cpu);
 }
 } /* for(;;) */
 
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index 1d97b63..dbe0438 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -270,6 +270,7 @@ struct CPUState {
 bool created;
 bool stop;
 bool stopped;
+bool cpu_loop_exit_locked;
 volatile sig_atomic_t exit_request;
 uint32_t interrupt_request;
 int singlestep_enabled;
-- 
1.9.1

[Qemu-devel] [RFC 24/38] cpu-exec: reset mmap_lock after exiting the CPU loop

2015-08-24 Thread Emilio G. Cota

Otherwise after an exception we end up in a deadlock.

Signed-off-by: Emilio G. Cota 
---
 bsd-user/mmap.c | 12 
 cpu-exec.c  |  1 +
 include/exec/exec-all.h |  2 ++
 linux-user/mmap.c   |  8 
 4 files changed, 23 insertions(+)

diff --git a/bsd-user/mmap.c b/bsd-user/mmap.c
index 092bf7f..b37a8f5 100644
--- a/bsd-user/mmap.c
+++ b/bsd-user/mmap.c
@@ -48,6 +48,14 @@ void mmap_unlock(void)
 }
 }
 
+void mmap_lock_reset(void)
+{
+while (mmap_lock_count) {
+mmap_lock_count--;
+pthread_mutex_unlock(&mmap_mutex);
+}
+}
+
 /* Grab lock to make sure things are in a consistent state after fork().  */
 void mmap_fork_start(void)
 {
@@ -72,6 +80,10 @@ void mmap_lock(void)
 void mmap_unlock(void)
 {
 }
+
+void mmap_lock_reset(void)
+{
+}
 #endif
 
 /* NOTE: all the constants are the HOST ones, but addresses are target. */
diff --git a/cpu-exec.c b/cpu-exec.c
index a1700ac..f758928 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -630,6 +630,7 @@ int cpu_exec(CPUState *cpu)
 env = &x86_cpu->env;
 #endif
 tb_lock_reset();
+mmap_lock_reset();
 cpu_exit_loop_lock_reset(cpu);
 }
 } /* for(;;) */
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index b1934bf..3b8399a 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -334,6 +334,7 @@ void tlb_fill(CPUState *cpu, target_ulong addr, int 
is_write, int mmu_idx,
 #if defined(CONFIG_USER_ONLY)
 void mmap_lock(void);
 void mmap_unlock(void);
+void mmap_lock_reset(void);
 
 static inline tb_page_addr_t get_page_addr_code(CPUArchState *env1, 
target_ulong addr)
 {
@@ -342,6 +343,7 @@ static inline tb_page_addr_t 
get_page_addr_code(CPUArchState *env1, target_ulong
 #else
 static inline void mmap_lock(void) {}
 static inline void mmap_unlock(void) {}
+static inline void mmap_lock_reset(void) {}
 
 /* cputlb.c */
 tb_page_addr_t get_page_addr_code(CPUArchState *env1, target_ulong addr);
diff --git a/linux-user/mmap.c b/linux-user/mmap.c
index 78e1b2d..8ee80f5 100644
--- a/linux-user/mmap.c
+++ b/linux-user/mmap.c
@@ -51,6 +51,14 @@ void mmap_unlock(void)
 }
 }
 
+void mmap_lock_reset(void)
+{
+if (mmap_lock_count) {
+mmap_lock_count = 0;
+pthread_mutex_unlock(&mmap_mutex);
+}
+}
+
 /* Grab lock to make sure things are in a consistent state after fork().  */
 void mmap_fork_start(void)
 {
-- 
1.9.1

[Qemu-devel] [RFC 26/38] cpu: protect tb_jmp_cache with seqlock

2015-08-24 Thread Emilio G. Cota

This paves the way for a lockless tb_find_fast.

Signed-off-by: Emilio G. Cota 
---
 cpu-exec.c|  8 +++-
 exec.c|  2 ++
 include/qom/cpu.h | 15 +++
 qom/cpu.c |  2 +-
 translate-all.c   | 32 +++-
 5 files changed, 56 insertions(+), 3 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index f758928..5ad578d 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -334,7 +334,9 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
 }
 
 /* we add the TB in the virtual pc hash table */
+seqlock_write_lock(&cpu->tb_jmp_cache_sequence);
 cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)] = tb;
+seqlock_write_unlock(&cpu->tb_jmp_cache_sequence);
 return tb;
 }
 
@@ -343,13 +345,17 @@ static inline TranslationBlock *tb_find_fast(CPUState 
*cpu)
 CPUArchState *env = (CPUArchState *)cpu->env_ptr;
 TranslationBlock *tb;
 target_ulong cs_base, pc;
+unsigned int version;
 int flags;
 
 /* we record a subset of the CPU state. It will
always be the same before a given translated block
is executed. */
 cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
-tb = cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)];
+do {
+version = seqlock_read_begin(&cpu->tb_jmp_cache_sequence);
+tb = cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)];
+} while (seqlock_read_retry(&cpu->tb_jmp_cache_sequence, version));
 if (unlikely(!tb || tb->pc != pc || tb->cs_base != cs_base ||
  tb->flags != flags)) {
 tb = tb_find_slow(cpu, pc, cs_base, flags);
diff --git a/exec.c b/exec.c
index edf2236..ae6f416 100644
--- a/exec.c
+++ b/exec.c
@@ -577,6 +577,8 @@ void cpu_exec_init(CPUState *cpu, Error **errp)
 int cpu_index;
 Error *local_err = NULL;
 
+qemu_mutex_init(&cpu->tb_jmp_cache_lock);
+seqlock_init(&cpu->tb_jmp_cache_sequence, &cpu->tb_jmp_cache_lock);
 #ifndef CONFIG_USER_ONLY
 cpu->as = &address_space_memory;
 cpu->thread_id = qemu_get_thread_id();
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index dbe0438..f383c24 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -27,6 +27,7 @@
 #include "exec/hwaddr.h"
 #include "exec/memattrs.h"
 #include "qemu/queue.h"
+#include "qemu/seqlock.h"
 #include "qemu/thread.h"
 #include "qemu/typedefs.h"
 
@@ -287,6 +288,13 @@ struct CPUState {
 
 void *env_ptr; /* CPUArchState */
 struct TranslationBlock *current_tb;
+/*
+ * The seqlock here is needed because not all updates are to a single
+ * entry; sometimes we want to atomically clear all entries that belong to
+ * a given page, e.g. when flushing said page.
+ */
+QemuMutex tb_jmp_cache_lock;
+QemuSeqLock tb_jmp_cache_sequence;
 struct TranslationBlock *tb_jmp_cache[TB_JMP_CACHE_SIZE];
 
 struct GDBRegisterState *gdb_regs;
@@ -342,6 +350,13 @@ extern struct CPUTailQ cpus;
 
 extern __thread CPUState *current_cpu;
 
+static inline void cpu_tb_jmp_cache_clear(CPUState *cpu)
+{
+seqlock_write_lock(&cpu->tb_jmp_cache_sequence);
+memset(cpu->tb_jmp_cache, 0, TB_JMP_CACHE_SIZE * sizeof(void *));
+seqlock_write_unlock(&cpu->tb_jmp_cache_sequence);
+}
+
 /**
  * cpu_paging_enabled:
  * @cpu: The CPU whose state is to be inspected.
diff --git a/qom/cpu.c b/qom/cpu.c
index ac19710..5e72e7a 100644
--- a/qom/cpu.c
+++ b/qom/cpu.c
@@ -251,7 +251,7 @@ static void cpu_common_reset(CPUState *cpu)
 cpu->icount_decr.u32 = 0;
 cpu->can_do_io = 1;
 cpu->exception_index = -1;
-memset(cpu->tb_jmp_cache, 0, TB_JMP_CACHE_SIZE * sizeof(void *));
+cpu_tb_jmp_cache_clear(cpu);
 }
 
 static bool cpu_common_has_work(CPUState *cs)
diff --git a/translate-all.c b/translate-all.c
index 76a0be8..668b43a 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -863,7 +863,7 @@ void tb_flush(CPUState *cpu)
 tcg_ctx.tb_ctx.nb_tbs = 0;
 
 CPU_FOREACH(cpu) {
-memset(cpu->tb_jmp_cache, 0, sizeof(cpu->tb_jmp_cache));
+cpu_tb_jmp_cache_clear(cpu);
 }
 
 memset(tcg_ctx.tb_ctx.tb_phys_hash, 0, 
sizeof(tcg_ctx.tb_ctx.tb_phys_hash));
@@ -988,6 +988,27 @@ static inline void tb_jmp_remove(TranslationBlock *tb, int 
n)
 }
 }
 
+static inline void tb_jmp_cache_entry_clear(CPUState *cpu, TranslationBlock 
*tb)
+{
+unsigned int version;
+unsigned int h;
+bool hit = false;
+
+h = tb_jmp_cache_hash_func(tb->pc);
+do {
+version = seqlock_read_begin(&cpu->tb_jmp_cache_sequence);
+hit = cpu->tb_jmp_cache[h] == tb;
+} while (seqlock_read_retry(&cpu->tb_jmp_cache_sequence, version));
+
+if (hit) {
+seqlock_write_lock(&cpu->tb_jmp_cache_sequence);
+if (likely(cpu->tb_jmp_cache[h] == tb)) {
+cpu->tb_jmp_cac

[Qemu-devel] [RFC 30/38] translate-all: add tb_lock assertions

2015-08-24 Thread Emilio G. Cota

Signed-off-by: Emilio G. Cota 
---
 translate-all.c | 28 
 1 file changed, 28 insertions(+)

diff --git a/translate-all.c b/translate-all.c
index df65c83..e7b4a31 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -252,6 +252,8 @@ static int cpu_restore_state_from_tb(CPUState *cpu, 
TranslationBlock *tb,
 int64_t ti;
 #endif
 
+assert(have_tb_lock);
+
 #ifdef CONFIG_PROFILER
 ti = profile_getclock();
 #endif
@@ -442,6 +444,10 @@ static PageDesc *page_find_alloc(tb_page_addr_t index, int 
alloc)
 void **lp;
 int i;
 
+#ifdef CONFIG_SOFTMMU
+assert(have_tb_lock);
+#endif
+
 /* Level 1.  Always allocated.  */
 lp = l1_map + ((index >> V_L1_SHIFT) & (V_L1_SIZE - 1));
 
@@ -767,6 +773,8 @@ static TranslationBlock *tb_alloc(target_ulong pc)
 {
 TranslationBlock *tb;
 
+assert(have_tb_lock);
+
 if (tcg_ctx.tb_ctx.nb_tbs >= tcg_ctx.code_gen_max_blocks ||
 (tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer) >=
  tcg_ctx.code_gen_buffer_max_size) {
@@ -781,6 +789,8 @@ static TranslationBlock *tb_alloc(target_ulong pc)
 /* Called with tb_lock held.  */
 void tb_free(TranslationBlock *tb)
 {
+assert(have_tb_lock);
+
 /* In practice this is mostly used for single use temporary TB
Ignore the hard cases and just back up if this TB happens to
be the last one generated.  */
@@ -933,6 +943,8 @@ static void tb_page_check(void)
 TranslationBlock *tb;
 int i, flags1, flags2;
 
+assert(have_tb_lock);
+
 for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) {
 TBPhysHashSlot *slot = &tcg_ctx.tb_ctx.tb_phys_hash[i];
 
@@ -1034,6 +1046,8 @@ void tb_phys_invalidate(TranslationBlock *tb, 
tb_page_addr_t page_addr)
 unsigned int n1;
 TranslationBlock *tb1, *tb2;
 
+assert(have_tb_lock);
+
 /* Now remove the TB from the hash list, so that tb_find_slow
  * cannot find it anymore.
  */
@@ -1120,6 +1134,8 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
 target_ulong virt_page2;
 int code_gen_size;
 
+assert(have_tb_lock);
+
 phys_pc = get_page_addr_code(env, pc);
 if (use_icount) {
 cflags |= CF_USE_ICOUNT;
@@ -1428,6 +1444,10 @@ static inline void tb_alloc_page(TranslationBlock *tb,
 bool page_already_protected;
 #endif
 
+#ifdef CONFIG_SOFTMMU
+assert(have_tb_lock);
+#endif
+
 tb->page_addr[n] = page_addr;
 p = page_find_alloc(page_addr >> TARGET_PAGE_BITS, 1);
 tb->page_next[n] = p->first_tb;
@@ -1486,6 +1506,10 @@ static void tb_link_page(TranslationBlock *tb, 
tb_page_addr_t phys_pc,
 unsigned int h;
 TBPhysHashSlot *slot;
 
+#ifdef CONFIG_SOFTMMU
+assert(have_tb_lock);
+#endif
+
 /* add in the physical hash table */
 h = tb_phys_hash_func(phys_pc);
 slot = &tcg_ctx.tb_ctx.tb_phys_hash[h];
@@ -1527,6 +1551,8 @@ static TranslationBlock *tb_find_pc(uintptr_t tc_ptr)
 uintptr_t v;
 TranslationBlock *tb;
 
+assert(have_tb_lock);
+
 if (tcg_ctx.tb_ctx.nb_tbs <= 0) {
 return NULL;
 }
@@ -1579,6 +1605,8 @@ void tb_check_watchpoint(CPUState *cpu)
 {
 TranslationBlock *tb;
 
+assert(have_tb_lock);
+
 tb = tb_find_pc(cpu->mem_io_pc);
 if (tb) {
 /* We can use retranslation to find the PC.  */
-- 
1.9.1

[Qemu-devel] [RFC 21/38] target-i386: emulate atomic instructions + barriers using AIE

2015-08-24 Thread Emilio G. Cota

Signed-off-by: Emilio G. Cota 
---
 aie-helper.c  |   3 +-
 linux-user/main.c |   4 +-
 target-i386/cpu.h |   3 -
 target-i386/excp_helper.c |   7 ++
 target-i386/helper.h  |   6 +-
 target-i386/mem_helper.c  |  39 +++--
 target-i386/translate.c   | 217 --
 7 files changed, 162 insertions(+), 117 deletions(-)

diff --git a/aie-helper.c b/aie-helper.c
index 7521150..a3faf04 100644
--- a/aie-helper.c
+++ b/aie-helper.c
@@ -82,7 +82,8 @@ void HELPER(aie_unlock__done)(CPUArchState *env)
 
 void HELPER(aie_ld_pre)(CPUArchState *env, target_ulong vaddr)
 {
-if (likely(!env->aie_lock_enabled) || env->aie_locked) {
+assert(env->aie_lock_enabled);
+if (env->aie_locked) {
 return;
 }
 aie_ld_lock_ret(env, vaddr, GETRA());
diff --git a/linux-user/main.c b/linux-user/main.c
index fd06ce9..98ebe19 100644
--- a/linux-user/main.c
+++ b/linux-user/main.c
@@ -279,9 +279,9 @@ void cpu_loop(CPUX86State *env)
 target_siginfo_t info;
 
 for(;;) {
-cpu_exec_start(cs);
+cs->running = true;
 trapnr = cpu_x86_exec(cs);
-cpu_exec_end(cs);
+cs->running = false;
 switch(trapnr) {
 case 0x80:
 /* linux syscall from int $0x80 */
diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 3655ff3..ead2832 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -1318,9 +1318,6 @@ static inline MemTxAttrs cpu_get_mem_attrs(CPUX86State 
*env)
 void cpu_set_mxcsr(CPUX86State *env, uint32_t val);
 void cpu_set_fpuc(CPUX86State *env, uint16_t val);
 
-/* mem_helper.c */
-void helper_lock_init(void);
-
 /* svm_helper.c */
 void cpu_svm_check_intercept_param(CPUX86State *env1, uint32_t type,
uint64_t param);
diff --git a/target-i386/excp_helper.c b/target-i386/excp_helper.c
index 99fca84..141cab4 100644
--- a/target-i386/excp_helper.c
+++ b/target-i386/excp_helper.c
@@ -96,6 +96,13 @@ static void QEMU_NORETURN raise_interrupt2(CPUX86State *env, 
int intno,
 {
 CPUState *cs = CPU(x86_env_get_cpu(env));
 
+if (unlikely(env->aie_locked)) {
+helper_aie_unlock__done(env);
+}
+if (unlikely(env->aie_lock_enabled)) {
+env->aie_lock_enabled = false;
+}
+
 if (!is_int) {
 cpu_svm_check_intercept_param(env, SVM_EXIT_EXCP_BASE + intno,
   error_code);
diff --git a/target-i386/helper.h b/target-i386/helper.h
index 74308f4..7d92140 100644
--- a/target-i386/helper.h
+++ b/target-i386/helper.h
@@ -1,8 +1,8 @@
 DEF_HELPER_FLAGS_4(cc_compute_all, TCG_CALL_NO_RWG_SE, tl, tl, tl, tl, int)
 DEF_HELPER_FLAGS_4(cc_compute_c, TCG_CALL_NO_RWG_SE, tl, tl, tl, tl, int)
 
-DEF_HELPER_0(lock, void)
-DEF_HELPER_0(unlock, void)
+DEF_HELPER_1(lock_enable, void, env)
+DEF_HELPER_1(lock_disable, void, env)
 DEF_HELPER_3(write_eflags, void, env, tl, i32)
 DEF_HELPER_1(read_eflags, tl, env)
 DEF_HELPER_2(divb_AL, void, env, tl)
@@ -217,3 +217,5 @@ DEF_HELPER_3(rcrl, tl, env, tl, tl)
 DEF_HELPER_3(rclq, tl, env, tl, tl)
 DEF_HELPER_3(rcrq, tl, env, tl, tl)
 #endif
+
+#include "qemu/aie-helper.h"
diff --git a/target-i386/mem_helper.c b/target-i386/mem_helper.c
index 8bf0da2..60abc8a 100644
--- a/target-i386/mem_helper.c
+++ b/target-i386/mem_helper.c
@@ -21,38 +21,21 @@
 #include "exec/helper-proto.h"
 #include "exec/cpu_ldst.h"
 
-/* broken thread support */
+#include "aie-helper.c"
 
-#if defined(CONFIG_USER_ONLY)
-QemuMutex global_cpu_lock;
-
-void helper_lock(void)
-{
-qemu_mutex_lock(&global_cpu_lock);
-}
-
-void helper_unlock(void)
-{
-qemu_mutex_unlock(&global_cpu_lock);
-}
-
-void helper_lock_init(void)
-{
-qemu_mutex_init(&global_cpu_lock);
-}
-#else
-void helper_lock(void)
+void helper_lock_enable(CPUX86State *env)
 {
+env->aie_lock_enabled = true;
 }
 
-void helper_unlock(void)
-{
-}
-
-void helper_lock_init(void)
+void helper_lock_disable(CPUX86State *env)
 {
+assert(env->aie_lock_enabled);
+if (env->aie_locked) {
+h_aie_unlock__done(env);
+}
+env->aie_lock_enabled = false;
 }
-#endif
 
 void helper_cmpxchg8b(CPUX86State *env, target_ulong a0)
 {
@@ -60,6 +43,7 @@ void helper_cmpxchg8b(CPUX86State *env, target_ulong a0)
 int eflags;
 
 eflags = cpu_cc_compute_all(env, CC_OP);
+aie_ld_lock_ret(env, a0, GETRA());
 d = cpu_ldq_data(env, a0);
 if (d == (((uint64_t)env->regs[R_EDX] << 32) | 
(uint32_t)env->regs[R_EAX])) {
 cpu_stq_data(env, a0, ((uint64_t)env->regs[R_ECX] << 32) | 
(uint32_t)env->regs[R_EBX]);
@@ -71,6 +55,7 @@ void helper_cmpxchg8b(CPUX86State *env, target_ulong a0)
 env->regs[R_EAX] = (uint32_t)d;
 eflags &= ~CC_Z;
 }
+helper_aie_unlock__done(env);
 CC_SRC = eflags;
 }
 
@@ -84,6 +69,7 @@ void helper_cmpxchg16b(CPUX86State *env, target_ulong a0

[Qemu-devel] [RFC 34/38] translate-all: use tcg_sched_work for tb_flush

2015-08-24 Thread Emilio G. Cota

While at it, add an assertion in tb_flush to check for tb_lock
being held.

Signed-off-by: Emilio G. Cota 
---
 translate-all.c | 40 +++-
 1 file changed, 11 insertions(+), 29 deletions(-)

diff --git a/translate-all.c b/translate-all.c
index f3f7fb2..378517d 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -869,30 +869,24 @@ static void page_flush_tb(void)
 }
 }
 
-#ifdef CONFIG_USER_ONLY
-void tb_flush_safe(CPUState *cpu)
+static void tb_flush_work(void *arg)
 {
-tb_flush(cpu);
-}
-#else
-static void tb_flush_work(void *opaque)
-{
-CPUState *cpu = opaque;
-tb_flush(cpu);
-}
+CPUState *cpu = arg;
 
-void tb_flush_safe(CPUState *cpu)
-{
+/*
+ * Really no need to acquire tb_lock since all other threads are
+ * asleep; let's just acquire it to pass the assertion in
+ * tb_flush and for the wmb when unlocking.
+ */
+tb_lock_nocheck();
 tb_flush(cpu);
-async_run_safe_work_on_cpu(cpu, tb_flush_work, cpu);
+tb_unlock();
 }
-#endif
 
 /* flush all the translation blocks */
-/* XXX: tb_flush is currently not thread safe */
 void tb_flush(CPUState *cpu)
 {
-tb_lock();
+assert(have_tb_lock);
 
 #if defined(DEBUG_FLUSH)
 printf("qemu: flush code_size=%ld nb_tbs=%d avg_tb_size=%ld\n",
@@ -919,8 +913,6 @@ void tb_flush(CPUState *cpu)
 /* XXX: flush processor icache at this point if cache flush is
expensive */
 tcg_ctx.tb_ctx.tb_flush_count++;
-
-tb_unlock();
 }
 
 #ifdef DEBUG_TB_CHECK
@@ -1164,17 +1156,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
 tb = tb_alloc(pc);
 if (!tb) {
 /* flush must be done */
-#ifdef CONFIG_USER_ONLY
-/* FIXME: kick all other CPUs out also for user-mode emulation.  */
-tb_flush(cpu);
-mmap_unlock();
-#else
-tb_flush_safe(cpu);
-#endif
-cpu_loop_exit(cpu);
-tb_flush(cpu);
-/* cannot fail at this point */
-tb = tb_alloc(pc);
+cpu_tcg_sched_work(cpu, tb_flush_work, cpu);
 }
 
 tb->tc_ptr = tcg_ctx.code_gen_ptr;
-- 
1.9.1

[Qemu-devel] [RFC 36/38] cputlb: use tcg_sched_work for tlb_flush_page_all

2015-08-24 Thread Emilio G. Cota

Signed-off-by: Emilio G. Cota 
---
 cputlb.c | 39 +++
 1 file changed, 11 insertions(+), 28 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index d81a4eb..717a856 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -145,41 +145,24 @@ void tlb_flush_page(CPUState *cpu, target_ulong addr)
 tb_flush_jmp_cache(cpu, addr);
 }
 
-struct TLBFlushPageParams {
-CPUState *cpu;
-target_ulong addr;
-};
-
-static void tlb_flush_page_async_work(void *opaque)
+static void __tlb_flush_page_all(void *arg)
 {
-struct TLBFlushPageParams *params = opaque;
+target_ulong addr = *(target_ulong *)arg;
+CPUState *cpu;
 
-tlb_flush_page(params->cpu, params->addr);
-g_free(params);
+CPU_FOREACH(cpu) {
+tlb_flush_page(cpu, addr);
+}
+g_free(arg);
 }
 
 void tlb_flush_page_all(target_ulong addr)
 {
-CPUState *cpu;
-struct TLBFlushPageParams *params;
+target_ulong *arg = g_malloc(sizeof(*arg));
 
-CPU_FOREACH(cpu) {
-#if 0 /* !MTTCG */
-tlb_flush_page(cpu, addr);
-#else
-if (qemu_cpu_is_self(cpu)) {
-/* async_run_on_cpu handle this case but this just avoid a malloc
- * here.
- */
-tlb_flush_page(cpu, addr);
-} else {
-params = g_malloc(sizeof(struct TLBFlushPageParams));
-params->cpu = cpu;
-params->addr = addr;
-async_run_on_cpu(cpu, tlb_flush_page_async_work, params);
-}
-#endif /* MTTCG */
-}
+*arg = addr;
+tb_lock();
+cpu_tcg_sched_work(current_cpu, __tlb_flush_page_all, arg);
 }
 
 /* update the TLBs so that writes to code in the virtual page 'addr'
-- 
1.9.1

[Qemu-devel] [RFC 32/38] cpu list: convert to RCU QLIST

2015-08-24 Thread Emilio G. Cota

This avoids the chance of reading a corrupted list of CPUs in usermode.

Note: this breaks hw/ppc/spapr due to the removal of CPU_FOREACH_REVERSE.

Signed-off-by: Emilio G. Cota 
---
 exec.c   | 16 ++--
 include/qom/cpu.h| 15 +++
 linux-user/main.c|  2 +-
 linux-user/syscall.c |  2 +-
 4 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/exec.c b/exec.c
index ae6f416..58cd096 100644
--- a/exec.c
+++ b/exec.c
@@ -87,7 +87,7 @@ static MemoryRegion io_mem_unassigned;
 
 #endif
 
-struct CPUTailQ cpus = QTAILQ_HEAD_INITIALIZER(cpus);
+struct CPUTailQ cpus = QLIST_HEAD_INITIALIZER(cpus);
 /* current CPU in the current thread. It is only valid inside
cpu_exec() */
 __thread CPUState *current_cpu;
@@ -596,7 +596,19 @@ void cpu_exec_init(CPUState *cpu, Error **errp)
 #endif
 return;
 }
-QTAILQ_INSERT_TAIL(&cpus, cpu, node);
+/* poor man's QLIST_INSERT_TAIL_RCU */
+if (QLIST_EMPTY_RCU(&cpus)) {
+QLIST_INSERT_HEAD_RCU(&cpus, cpu, node);
+} else {
+CPUState *some_cpu;
+
+CPU_FOREACH(some_cpu) {
+if (QLIST_NEXT_RCU(some_cpu, node) == NULL) {
+QLIST_INSERT_AFTER_RCU(some_cpu, cpu, node);
+break;
+}
+}
+}
 #if defined(CONFIG_USER_ONLY)
 cpu_list_unlock();
 #endif
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index f383c24..ab484be 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -30,6 +30,7 @@
 #include "qemu/seqlock.h"
 #include "qemu/thread.h"
 #include "qemu/typedefs.h"
+#include "qemu/rcu_queue.h"
 
 typedef int (*WriteCoreDumpFunction)(const void *buf, size_t size,
  void *opaque);
@@ -300,7 +301,7 @@ struct CPUState {
 struct GDBRegisterState *gdb_regs;
 int gdb_num_regs;
 int gdb_num_g_regs;
-QTAILQ_ENTRY(CPUState) node;
+QLIST_ENTRY(CPUState) node;
 
 /* ice debug support */
 QTAILQ_HEAD(breakpoints_head, CPUBreakpoint) breakpoints;
@@ -338,15 +339,13 @@ struct CPUState {
 volatile sig_atomic_t tcg_exit_req;
 };
 
-QTAILQ_HEAD(CPUTailQ, CPUState);
+QLIST_HEAD(CPUTailQ, CPUState);
 extern struct CPUTailQ cpus;
-#define CPU_NEXT(cpu) QTAILQ_NEXT(cpu, node)
-#define CPU_FOREACH(cpu) QTAILQ_FOREACH(cpu, &cpus, node)
+#define CPU_NEXT(cpu) QLIST_NEXT_RCU(cpu, node)
+#define CPU_FOREACH(cpu) QLIST_FOREACH_RCU(cpu, &cpus, node)
 #define CPU_FOREACH_SAFE(cpu, next_cpu) \
-QTAILQ_FOREACH_SAFE(cpu, &cpus, node, next_cpu)
-#define CPU_FOREACH_REVERSE(cpu) \
-QTAILQ_FOREACH_REVERSE(cpu, &cpus, CPUTailQ, node)
-#define first_cpu QTAILQ_FIRST(&cpus)
+QLIST_FOREACH_SAFE_RCU(cpu, &cpus, node, next_cpu)
+#define first_cpu QLIST_FIRST_RCU(&cpus)
 
 extern __thread CPUState *current_cpu;
 
diff --git a/linux-user/main.c b/linux-user/main.c
index 98ebe19..3e10bd8 100644
--- a/linux-user/main.c
+++ b/linux-user/main.c
@@ -121,7 +121,7 @@ void fork_end(int child)
Discard information about the parent threads.  */
 CPU_FOREACH_SAFE(cpu, next_cpu) {
 if (cpu != thread_cpu) {
-QTAILQ_REMOVE(&cpus, thread_cpu, node);
+QLIST_REMOVE_RCU(thread_cpu, node);
 }
 }
 pending_cpus = 0;
diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index 732936f..e166313 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -5604,7 +5604,7 @@ abi_long do_syscall(void *cpu_env, int num, abi_long arg1,
 
 cpu_list_lock();
 /* Remove the CPU from the list.  */
-QTAILQ_REMOVE(&cpus, cpu, node);
+QLIST_REMOVE(cpu, node);
 cpu_list_unlock();
 ts = cpu->opaque;
 if (ts->child_tidptr) {
-- 
1.9.1

[Qemu-devel] [RFC 35/38] cputlb: use cpu_tcg_sched_work for tlb_flush_all

2015-08-24 Thread Emilio G. Cota

Signed-off-by: Emilio G. Cota 
---
 cputlb.c | 41 +++--
 1 file changed, 11 insertions(+), 30 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index 1b3673e..d81a4eb 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -73,43 +73,24 @@ void tlb_flush(CPUState *cpu, int flush_global)
 tlb_flush_count++;
 }
 
-struct TLBFlushParams {
-CPUState *cpu;
-int flush_global;
-};
-
-static void tlb_flush_async_work(void *opaque)
+static void __tlb_flush_all(void *arg)
 {
-struct TLBFlushParams *params = opaque;
+CPUState *cpu;
+int flush_global = *(int *)arg;
 
-tlb_flush(params->cpu, params->flush_global);
-g_free(params);
+CPU_FOREACH(cpu) {
+tlb_flush(cpu, flush_global);
+}
+g_free(arg);
 }
 
 void tlb_flush_all(int flush_global)
 {
-CPUState *cpu;
-struct TLBFlushParams *params;
+int *arg = g_malloc(sizeof(*arg));
 
-#if 0 /* MTTCG */
-CPU_FOREACH(cpu) {
-tlb_flush(cpu, flush_global);
-}
-#else
-CPU_FOREACH(cpu) {
-if (qemu_cpu_is_self(cpu)) {
-/* async_run_on_cpu handle this case but this just avoid a malloc
- * here.
- */
-tlb_flush(cpu, flush_global);
-} else {
-params = g_malloc(sizeof(struct TLBFlushParams));
-params->cpu = cpu;
-params->flush_global = flush_global;
-async_run_on_cpu(cpu, tlb_flush_async_work, params);
-}
-}
-#endif /* MTTCG */
+*arg = flush_global;
+tb_lock();
+cpu_tcg_sched_work(current_cpu, __tlb_flush_all, arg);
 }
 
 static inline void tlb_flush_entry(CPUTLBEntry *tlb_entry, target_ulong addr)
-- 
1.9.1

[Qemu-devel] [RFC 22/38] cpu: update interrupt_request atomically

2015-08-24 Thread Emilio G. Cota

Signed-off-by: Emilio G. Cota 
---
 cpu-exec.c |  9 ++---
 exec.c |  2 +-
 hw/openrisc/cputimer.c |  2 +-
 qom/cpu.c  |  4 ++--
 target-arm/helper-a64.c|  2 +-
 target-arm/helper.c|  2 +-
 target-i386/helper.c   |  2 +-
 target-i386/seg_helper.c   | 14 +++---
 target-i386/svm_helper.c   |  4 ++--
 target-openrisc/interrupt_helper.c |  2 +-
 target-openrisc/sys_helper.c   |  2 +-
 target-ppc/excp_helper.c   |  8 
 target-ppc/helper_regs.h   |  2 +-
 target-s390x/helper.c  |  2 +-
 target-unicore32/softmmu.c |  2 +-
 translate-all.c|  4 ++--
 16 files changed, 33 insertions(+), 30 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index 2b9a447..fd57b9c 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -463,12 +463,14 @@ int cpu_exec(CPUState *cpu)
 interrupt_request &= ~CPU_INTERRUPT_SSTEP_MASK;
 }
 if (interrupt_request & CPU_INTERRUPT_DEBUG) {
-cpu->interrupt_request &= ~CPU_INTERRUPT_DEBUG;
+atomic_and(&cpu->interrupt_request,
+   ~CPU_INTERRUPT_DEBUG);
 cpu->exception_index = EXCP_DEBUG;
 cpu_loop_exit(cpu);
 }
 if (interrupt_request & CPU_INTERRUPT_HALT) {
-cpu->interrupt_request &= ~CPU_INTERRUPT_HALT;
+atomic_and(&cpu->interrupt_request,
+   ~CPU_INTERRUPT_HALT);
 cpu->halted = 1;
 cpu->exception_index = EXCP_HLT;
 cpu_loop_exit(cpu);
@@ -495,7 +497,8 @@ int cpu_exec(CPUState *cpu)
 /* Don't use the cached interrupt_request value,
do_interrupt may have updated the EXITTB flag. */
 if (cpu->interrupt_request & CPU_INTERRUPT_EXITTB) {
-cpu->interrupt_request &= ~CPU_INTERRUPT_EXITTB;
+atomic_and(&cpu->interrupt_request,
+   ~CPU_INTERRUPT_EXITTB);
 /* ensure that no TB jump will be modified as
the program flow was changed */
 next_tb = 0;
diff --git a/exec.c b/exec.c
index ff79c3d..edf2236 100644
--- a/exec.c
+++ b/exec.c
@@ -445,7 +445,7 @@ static int cpu_common_post_load(void *opaque, int 
version_id)
 
 /* 0x01 was CPU_INTERRUPT_EXIT. This line can be removed when the
version_id is increased. */
-cpu->interrupt_request &= ~0x01;
+atomic_and(&cpu->interrupt_request, ~0x01);
 tlb_flush(cpu, 1);
 
 return 0;
diff --git a/hw/openrisc/cputimer.c b/hw/openrisc/cputimer.c
index 9c54945..2451698 100644
--- a/hw/openrisc/cputimer.c
+++ b/hw/openrisc/cputimer.c
@@ -85,7 +85,7 @@ static void openrisc_timer_cb(void *opaque)
 CPUState *cs = CPU(cpu);
 
 cpu->env.ttmr |= TTMR_IP;
-cs->interrupt_request |= CPU_INTERRUPT_TIMER;
+atomic_or(&cs->interrupt_request, CPU_INTERRUPT_TIMER);
 }
 
 switch (cpu->env.ttmr & TTMR_M) {
diff --git a/qom/cpu.c b/qom/cpu.c
index 3841f0d..ac19710 100644
--- a/qom/cpu.c
+++ b/qom/cpu.c
@@ -108,7 +108,7 @@ static void cpu_common_get_memory_mapping(CPUState *cpu,
 
 void cpu_reset_interrupt(CPUState *cpu, int mask)
 {
-cpu->interrupt_request &= ~mask;
+atomic_and(&cpu->interrupt_request, ~mask);
 }
 
 void cpu_exit(CPUState *cpu)
@@ -242,7 +242,7 @@ static void cpu_common_reset(CPUState *cpu)
 log_cpu_state(cpu, cc->reset_dump_flags);
 }
 
-cpu->interrupt_request = 0;
+atomic_set(&cpu->interrupt_request, 0);
 cpu->current_tb = NULL;
 cpu->halted = 0;
 cpu->mem_io_pc = 0;
diff --git a/target-arm/helper-a64.c b/target-arm/helper-a64.c
index 08c95a3..05c4ab6 100644
--- a/target-arm/helper-a64.c
+++ b/target-arm/helper-a64.c
@@ -541,6 +541,6 @@ void aarch64_cpu_do_interrupt(CPUState *cs)
 aarch64_restore_sp(env, new_el);
 
 env->pc = addr;
-cs->interrupt_request |= CPU_INTERRUPT_EXITTB;
+atomic_or(&cs->interrupt_request, CPU_INTERRUPT_EXITTB);
 }
 #endif
diff --git a/target-arm/helper.c b/target-arm/helper.c
index aea5a4b..fc8f5b3 100644
--- a/target-arm/helper.c
+++ b/target-arm/helper.c
@@ -4980,7 +4980,7 @@ void arm_cpu_do_interrupt(CPUState *cs)
 }
 env->regs[14] = env->regs[15] + offset;
 env->regs[15] = addr;
-cs->interrupt_request |= CPU_INTERRUPT_EXITTB;
+atomic_or(&cs->interrupt_request, CPU_INTERRUPT_EXITTB);
 }
 
 
diff --git a/target-i386/helper.c

[Qemu-devel] [RFC 38/38] Revert "target-i386: yield to another VCPU on PAUSE"

2015-08-24 Thread Emilio G. Cota

This reverts commit 81f3053b77f7d3a4d9100c425cd8cec99ee7a3d4.

The interrupt raised by the change in the commit above
kills performance when running many idling VCPUs. For example,
on my 64-core host when running a workload where cores are
idling often (e.g. blackscholes), performance drops significantly
because threads are most of the time just exiting the CPU loop,
thereby causing great contention on the BQL.

Fix it by reverting to the old behaviour by which no
interrupt is raised, which shouldn't be an issue given that
we have now one thread per VCPU.

Signed-off-by: Emilio G. Cota 

Conflicts:
target-i386/misc_helper.c
---
 target-i386/helper.h  |  1 -
 target-i386/misc_helper.c | 22 ++
 target-i386/translate.c   |  5 +
 3 files changed, 3 insertions(+), 25 deletions(-)

diff --git a/target-i386/helper.h b/target-i386/helper.h
index 7d92140..495d9f8 100644
--- a/target-i386/helper.h
+++ b/target-i386/helper.h
@@ -56,7 +56,6 @@ DEF_HELPER_2(sysret, void, env, int)
 DEF_HELPER_2(hlt, void, env, int)
 DEF_HELPER_2(monitor, void, env, tl)
 DEF_HELPER_2(mwait, void, env, int)
-DEF_HELPER_2(pause, void, env, int)
 DEF_HELPER_1(debug, void, env)
 DEF_HELPER_1(reset_rf, void, env)
 DEF_HELPER_3(raise_interrupt, void, env, int, int)
diff --git a/target-i386/misc_helper.c b/target-i386/misc_helper.c
index 52c5d65..0389df2 100644
--- a/target-i386/misc_helper.c
+++ b/target-i386/misc_helper.c
@@ -556,15 +556,6 @@ void helper_rdmsr(CPUX86State *env)
 }
 #endif
 
-static void do_pause(X86CPU *cpu)
-{
-CPUState *cs = CPU(cpu);
-
-/* Just let another CPU run.  */
-cs->exception_index = EXCP_INTERRUPT;
-cpu_loop_exit(cs);
-}
-
 static void do_hlt(X86CPU *cpu)
 {
 CPUState *cs = CPU(cpu);
@@ -610,22 +601,13 @@ void helper_mwait(CPUX86State *env, int next_eip_addend)
 cs = CPU(cpu);
 /* XXX: not complete but not completely erroneous */
 if (cs->cpu_index != 0 || CPU_NEXT(cs) != NULL) {
-do_pause(cpu);
+/* more than one CPU: do not sleep because another CPU may
+   wake this one */
 } else {
 do_hlt(cpu);
 }
 }
 
-void helper_pause(CPUX86State *env, int next_eip_addend)
-{
-X86CPU *cpu = x86_env_get_cpu(env);
-
-cpu_svm_check_intercept_param(env, SVM_EXIT_PAUSE, 0);
-env->eip += next_eip_addend;
-
-do_pause(cpu);
-}
-
 void helper_debug(CPUX86State *env)
 {
 CPUState *cs = CPU(x86_env_get_cpu(env));
diff --git a/target-i386/translate.c b/target-i386/translate.c
index 4d6030f..3b68660 100644
--- a/target-i386/translate.c
+++ b/target-i386/translate.c
@@ -6934,10 +6934,7 @@ static target_ulong disas_insn(CPUX86State *env, 
DisasContext *s,
 goto do_xchg_reg_eax;
 }
 if (prefixes & PREFIX_REPZ) {
-gen_update_cc_op(s);
-gen_jmp_im(pc_start - s->cs_base);
-gen_helper_pause(cpu_env, tcg_const_i32(s->pc - pc_start));
-s->is_jmp = DISAS_TB_JUMP;
+gen_svm_check_intercept(s, pc_start, SVM_EXIT_PAUSE);
 }
 break;
 case 0x9b: /* fwait */
-- 
1.9.1

[Qemu-devel] [RFC 27/38] cpu-exec: convert tb_invalidated_flag into a per-TB flag

2015-08-24 Thread Emilio G. Cota

This will allow us to safely look up TB's without taking any locks.
Note however that tb_lock protects the valid field, so if chaining
is an option then we'll have to acquire the lock.

Signed-off-by: Emilio G. Cota 
---
 cpu-exec.c  | 23 +++---
 include/exec/exec-all.h |  3 +--
 translate-all.c | 51 +
 3 files changed, 25 insertions(+), 52 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index 5ad578d..826ec25 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -239,9 +239,7 @@ static void cpu_exec_nocache(CPUState *cpu, int max_cycles,
 tb_lock();
 tb = tb_gen_code(cpu, orig_tb->pc, orig_tb->cs_base, orig_tb->flags,
  max_cycles | CF_NOCACHE);
-tb->orig_tb = (atomic_mb_read(&tcg_ctx.tb_ctx.tb_invalidated_flag)
-   ? NULL
-   : orig_tb);
+tb->orig_tb = orig_tb->valid ? orig_tb : NULL;
 cpu->current_tb = tb;
 tb_unlock();
 
@@ -268,8 +266,6 @@ static TranslationBlock *tb_find_physical(CPUState *cpu,
 tb_page_addr_t phys_pc, phys_page1;
 target_ulong virt_page2;
 
-atomic_mb_set(&tcg_ctx.tb_ctx.tb_invalidated_flag, 0);
-
 /* find translated block using physical mappings */
 phys_pc = get_page_addr_code(env, pc);
 phys_page1 = phys_pc & TARGET_PAGE_MASK;
@@ -536,15 +532,6 @@ int cpu_exec(CPUState *cpu)
 cpu_loop_exit(cpu);
 }
 tb = tb_find_fast(cpu);
-/* Note: we do it here to avoid a gcc bug on Mac OS X when
-   doing it in tb_find_slow */
-if (atomic_mb_read(&tcg_ctx.tb_ctx.tb_invalidated_flag)) {
-/* as some TB could have been invalidated because
-   of memory exceptions while generating the code, we
-   must recompute the hash index here */
-next_tb = 0;
-atomic_mb_set(&tcg_ctx.tb_ctx.tb_invalidated_flag, 0);
-}
 if (qemu_loglevel_mask(CPU_LOG_EXEC)) {
 qemu_log("Trace %p [" TARGET_FMT_lx "] %s\n",
  tb->tc_ptr, tb->pc, lookup_symbol(tb->pc));
@@ -553,9 +540,13 @@ int cpu_exec(CPUState *cpu)
spans two pages, we cannot safely do a direct
jump. */
 if (next_tb != 0 && tb->page_addr[1] == -1) {
+TranslationBlock *next;
+
 tb_lock_recursive();
-tb_add_jump((TranslationBlock *)(next_tb & ~TB_EXIT_MASK),
-next_tb & TB_EXIT_MASK, tb);
+next = (TranslationBlock *)(next_tb & ~TB_EXIT_MASK);
+if (tb->valid && next->valid) {
+tb_add_jump(next, next_tb & TB_EXIT_MASK, tb);
+}
 }
 /* The lock may not be taken if we went through the
  * fast lookup path and did not have to do any patching.
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 3b8399a..7e4aea7 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -178,6 +178,7 @@ struct TranslationBlock {
jmp_first */
 struct TranslationBlock *jmp_next[2];
 struct TranslationBlock *jmp_first;
+bool valid; /* protected by tb_lock */
 };
 
 #include "qemu/thread.h"
@@ -195,8 +196,6 @@ struct TBContext {
 /* statistics */
 int tb_flush_count;
 int tb_phys_invalidate_count;
-
-int tb_invalidated_flag;
 };
 
 void tb_free(TranslationBlock *tb);
diff --git a/translate-all.c b/translate-all.c
index 668b43a..94adcd0 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -791,6 +791,17 @@ static inline void invalidate_page_bitmap(PageDesc *p)
 #endif
 }
 
+static void tb_invalidate_all(void)
+{
+int i;
+
+for (i = 0; i < tcg_ctx.tb_ctx.nb_tbs; i++) {
+TranslationBlock *tb = &tcg_ctx.tb_ctx.tbs[i];
+
+tb->valid = false;
+}
+}
+
 /* Set to NULL all the 'first_tb' fields in all PageDescs. */
 static void page_flush_tb_1(int level, void **lp)
 {
@@ -866,6 +877,7 @@ void tb_flush(CPUState *cpu)
 cpu_tb_jmp_cache_clear(cpu);
 }
 
+tb_invalidate_all();
 memset(tcg_ctx.tb_ctx.tb_phys_hash, 0, 
sizeof(tcg_ctx.tb_ctx.tb_phys_hash));
 page_flush_tb();
 
@@ -1021,11 +1033,6 @@ void tb_phys_invalidate(TranslationBlock *tb, 
tb_page_addr_t page_addr)
 tb_page_addr_t phys_pc;
 TranslationBlock *tb1, *tb2;
 
-/* Set the invalidated_flag first, to block patching a
- * jump to tb.  FIXME: invalidated_flag should be per TB.
- */
-atomic_mb_set(&tcg_ctx.tb_ctx.tb_invalidated_flag, 1);
-
 /* Now remove the TB from the hash list, so that tb_find_slow
  * cannot

[Qemu-devel] [RFC 37/38] cpus: remove async_run_safe_work_on_cpu

2015-08-24 Thread Emilio G. Cota

It has no callers left.

Signed-off-by: Emilio G. Cota 
---
 cpu-exec.c| 10 -
 cpus.c| 64 +--
 include/qom/cpu.h | 24 +
 3 files changed, 2 insertions(+), 96 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index 378ce52..6d7bcc0 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -499,16 +499,6 @@ int cpu_exec(CPUState *cpu)
 }
 qemu_mutex_unlock(cpu->tcg_work_lock);
 
-#ifndef CONFIG_USER_ONLY
-/* FIXME: user-mode emulation probably needs a similar mechanism as well,
- * for example for tb_flush.
- */
-if (async_safe_work_pending()) {
-cpu->exit_request = 1;
-return 0;
-}
-#endif
-
 if (cpu->halted) {
 if (!cpu_has_work(cpu)) {
 return EXCP_HALTED;
diff --git a/cpus.c b/cpus.c
index 0fe6576..e1033e1 100644
--- a/cpus.c
+++ b/cpus.c
@@ -68,8 +68,6 @@
 int64_t max_delay;
 int64_t max_advance;
 
-int safe_work_pending; /* Number of safe work pending for all VCPUs. */
-
 bool cpu_is_stopped(CPUState *cpu)
 {
 return cpu->stopped || !runstate_is_running();
@@ -77,7 +75,7 @@ bool cpu_is_stopped(CPUState *cpu)
 
 static bool cpu_thread_is_idle(CPUState *cpu)
 {
-if (cpu->stop || cpu->queued_work_first || cpu->queued_safe_work_first) {
+if (cpu->stop || cpu->queued_work_first) {
 return false;
 }
 if (cpu_is_stopped(cpu)) {
@@ -860,63 +858,6 @@ void async_run_on_cpu(CPUState *cpu, void (*func)(void 
*data), void *data)
 qemu_cpu_kick(cpu);
 }
 
-void async_run_safe_work_on_cpu(CPUState *cpu, void (*func)(void *data),
-void *data)
-{
-struct qemu_work_item *wi;
-
-wi = g_malloc0(sizeof(struct qemu_work_item));
-wi->func = func;
-wi->data = data;
-wi->free = true;
-
-atomic_inc(&safe_work_pending);
-qemu_mutex_lock(&cpu->work_mutex);
-if (cpu->queued_safe_work_first == NULL) {
-cpu->queued_safe_work_first = wi;
-} else {
-cpu->queued_safe_work_last->next = wi;
-}
-cpu->queued_safe_work_last = wi;
-wi->next = NULL;
-wi->done = false;
-qemu_mutex_unlock(&cpu->work_mutex);
-
-CPU_FOREACH(cpu) {
-qemu_cpu_kick(cpu);
-}
-}
-
-static void flush_queued_safe_work(CPUState *cpu)
-{
-struct qemu_work_item *wi;
-
-if (cpu->queued_safe_work_first == NULL) {
-return;
-}
-
-qemu_mutex_lock(&cpu->work_mutex);
-while ((wi = cpu->queued_safe_work_first)) {
-cpu->queued_safe_work_first = wi->next;
-qemu_mutex_unlock(&cpu->work_mutex);
-wi->func(wi->data);
-qemu_mutex_lock(&cpu->work_mutex);
-wi->done = true;
-if (wi->free) {
-g_free(wi);
-}
-atomic_dec(&safe_work_pending);
-}
-cpu->queued_safe_work_last = NULL;
-qemu_mutex_unlock(&cpu->work_mutex);
-qemu_cond_broadcast(&qemu_work_cond);
-}
-
-bool async_safe_work_pending(void)
-{
-return safe_work_pending != 0;
-}
-
 static void flush_queued_work(CPUState *cpu)
 {
 struct qemu_work_item *wi;
@@ -953,9 +894,6 @@ static void qemu_wait_io_event_common(CPUState *cpu)
 cpu->stopped = true;
 qemu_cond_signal(&qemu_pause_cond);
 }
-qemu_mutex_unlock_iothread();
-flush_queued_safe_work(cpu);
-qemu_mutex_lock_iothread();
 flush_queued_work(cpu);
 }
 
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index aba7edb..79045b4 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -245,9 +245,8 @@ struct kvm_run;
  * @mem_io_pc: Host Program Counter at which the memory was accessed.
  * @mem_io_vaddr: Target virtual address at which the memory was accessed.
  * @kvm_fd: vCPU file descriptor for KVM.
- * @work_mutex: Lock to prevent multiple access to queued_* qemu_work_item.
+ * @work_mutex: Lock to prevent multiple access to queued_work_*.
  * @queued_work_first: First asynchronous work pending.
- * @queued_safe_work_first: First item of safe work pending.
  *
  * State of one CPU core or thread.
  */
@@ -290,7 +289,6 @@ struct CPUState {
 
 QemuMutex work_mutex;
 struct qemu_work_item *queued_work_first, *queued_work_last;
-struct qemu_work_item *queued_safe_work_first, *queued_safe_work_last;
 
 AddressSpace *as;
 struct AddressSpaceDispatch *memory_dispatch;
@@ -571,26 +569,6 @@ void run_on_cpu(CPUState *cpu, void (*func)(void *data), 
void *data);
 void async_run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data);
 
 /**
- * async_run_safe_work_on_cpu:
- * @cpu: The vCPU to run on.
- * @func: The function to be executed.
- * @data: Data to pass to the function.
- *
- * Schedules the function @func for execution on the vCPU @cpu asynchronously
- * when all the VCPUs are outside their loop.
- */
-void async_run_safe_work_on_cpu(CPUState *cpu, v

[Qemu-devel] [RFC 33/38] cpu: introduce cpu_tcg_sched_work to run work while other CPUs sleep

2015-08-24 Thread Emilio G. Cota

This is similar in intent to the async_safe_work mechanism. The main
differences are:

- Work is run on a single CPU thread *after* all others are put to sleep

- Sleeping threads are woken up by the worker thread upon completing its job

- A flag as been added to tcg_ctx so that only one thread can schedule
  work at a time. The flag is checked every time tb_lock is acquired.

- Handles the possibility of CPU threads being created after the existing
  CPUs are put to sleep. This is easily triggered with many threads on
  a many-core host in usermode.

- Works for both softmmu and usermode

Signed-off-by: Emilio G. Cota 
---
 cpu-exec.c  | 89 +
 exec.c  |  4 +++
 include/exec/exec-all.h |  5 +++
 include/qom/cpu.h   | 20 +++
 tcg/tcg.h   |  1 +
 translate-all.c | 23 -
 6 files changed, 141 insertions(+), 1 deletion(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index ff08da8..378ce52 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -393,6 +393,57 @@ static inline void cpu_exit_loop_lock_reset(CPUState *cpu)
 { }
 #endif
 
+static inline void cpu_sleep_other(CPUState *cpu, CPUState *curr)
+{
+assert(cpu->tcg_sleep_owner == NULL);
+qemu_mutex_lock(cpu->tcg_work_lock);
+cpu->tcg_sleep_requests++;
+cpu->tcg_sleep_owner = curr;
+qemu_mutex_unlock(cpu->tcg_work_lock);
+#ifdef CONFIG_SOFTMMU
+cpu_exit(cpu);
+#else
+/* cannot call cpu_exit(); cpu->exit_request is not for usermode */
+smp_wmb();
+cpu->tcg_exit_req = 1;
+#endif
+}
+
+/* call with no locks held */
+static inline void cpu_sleep_others(CPUState *curr)
+{
+CPUState *cpu;
+
+CPU_FOREACH(cpu) {
+if (cpu == curr) {
+continue;
+}
+cpu_sleep_other(cpu, curr);
+}
+/* wait until all other threads are out of the execution loop */
+synchronize_rcu();
+}
+
+static inline void cpu_wake_others(CPUState *curr)
+{
+CPUState *cpu;
+
+CPU_FOREACH(cpu) {
+if (cpu == curr) {
+continue;
+}
+if (cpu->tcg_sleep_owner != curr) {
+assert(!cpu->inited);
+continue;
+}
+qemu_mutex_lock(cpu->tcg_work_lock);
+cpu->tcg_sleep_requests--;
+cpu->tcg_sleep_owner = NULL;
+qemu_cond_signal(cpu->tcg_work_cond);
+qemu_mutex_unlock(cpu->tcg_work_lock);
+}
+}
+
 /* main execution loop */
 
 int cpu_exec(CPUState *cpu)
@@ -410,6 +461,44 @@ int cpu_exec(CPUState *cpu)
 
 current_cpu = cpu;
 
+/*
+ * Prevent threads that were created during a TCG work critical section
+ * (and that therefore didn't have cpu->tcg_work_owner set) from executing.
+ * What we do is then to not let them run by sending them out of the CPU
+ * loop until the tcg_work_pending flag goes down.
+ */
+if (unlikely(!cpu->inited)) {
+tb_lock();
+tb_unlock();
+cpu->inited = true;
+}
+
+if (cpu->tcg_work_func) {
+cpu_sleep_others(cpu);
+/*
+ * At this point all existing threads are sleeping.
+ * With the check above we make sure that threads that might be
+ * concurrently added at this point won't execute until the end of the
+ * work window, so we can safely call the work function.
+ */
+cpu->tcg_work_func(cpu->tcg_work_arg);
+cpu->tcg_work_func = NULL;
+cpu->tcg_work_arg = NULL;
+
+/* mark the end of the TCG work critical section */
+tb_lock_nocheck();
+tcg_ctx.tb_ctx.work_pending = false;
+tb_unlock();
+cpu_wake_others(cpu);
+}
+
+qemu_mutex_lock(cpu->tcg_work_lock);
+assert(cpu->tcg_sleep_requests >= 0);
+while (unlikely(cpu->tcg_sleep_requests)) {
+qemu_cond_wait(cpu->tcg_work_cond, cpu->tcg_work_lock);
+}
+qemu_mutex_unlock(cpu->tcg_work_lock);
+
 #ifndef CONFIG_USER_ONLY
 /* FIXME: user-mode emulation probably needs a similar mechanism as well,
  * for example for tb_flush.
diff --git a/exec.c b/exec.c
index 58cd096..45a9761 100644
--- a/exec.c
+++ b/exec.c
@@ -579,6 +579,10 @@ void cpu_exec_init(CPUState *cpu, Error **errp)
 
 qemu_mutex_init(&cpu->tb_jmp_cache_lock);
 seqlock_init(&cpu->tb_jmp_cache_sequence, &cpu->tb_jmp_cache_lock);
+cpu->tcg_work_cond = g_malloc(sizeof(*cpu->tcg_work_cond));
+qemu_cond_init(cpu->tcg_work_cond);
+cpu->tcg_work_lock = g_malloc(sizeof(*cpu->tcg_work_lock));
+qemu_mutex_init(cpu->tcg_work_lock);
 #ifndef CONFIG_USER_ONLY
 cpu->as = &address_space_memory;
 cpu->thread_id = qemu_get_thread_id();
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 050e820..be8315c 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -198,6 +198,11 @@ stru

[Qemu-devel] [RFC 28/38] cpu-exec: use RCU to perform lockless TB lookups

2015-08-24 Thread Emilio G. Cota

Only grab tb_lock when new code has to be generated.

Note that due to the RCU usage we lose the ability to move
recently-found TB's to the beginning of the slot's list.
We could in theory try to do something smart about this,
but given that each CPU has a private tb_jmp_cache, it
might be OK to just leave it alone.

Signed-off-by: Emilio G. Cota 
---
 cpu-exec.c  | 21 +---
 include/exec/exec-all.h | 12 +---
 translate-all.c | 52 -
 3 files changed, 43 insertions(+), 42 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index 826ec25..ff08da8 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -24,6 +24,7 @@
 #include "qemu/atomic.h"
 #include "qemu/timer.h"
 #include "exec/tb-hash.h"
+#include "qemu/rcu_queue.h"
 #include "qemu/rcu.h"
 
 #if !defined(CONFIG_USER_ONLY)
@@ -261,7 +262,8 @@ static TranslationBlock *tb_find_physical(CPUState *cpu,
   uint64_t flags)
 {
 CPUArchState *env = (CPUArchState *)cpu->env_ptr;
-TranslationBlock *tb, **ptb1;
+TBPhysHashSlot *slot;
+TranslationBlock *tb;
 unsigned int h;
 tb_page_addr_t phys_pc, phys_page1;
 target_ulong virt_page2;
@@ -270,12 +272,9 @@ static TranslationBlock *tb_find_physical(CPUState *cpu,
 phys_pc = get_page_addr_code(env, pc);
 phys_page1 = phys_pc & TARGET_PAGE_MASK;
 h = tb_phys_hash_func(phys_pc);
-ptb1 = &tcg_ctx.tb_ctx.tb_phys_hash[h];
-for(;;) {
-tb = atomic_rcu_read(ptb1);
-if (!tb) {
-return NULL;
-}
+slot = &tcg_ctx.tb_ctx.tb_phys_hash[h];
+
+QLIST_FOREACH_RCU(tb, &slot->list, slot_node) {
 if (tb->pc == pc &&
 tb->page_addr[0] == phys_page1 &&
 tb->cs_base == cs_base &&
@@ -288,16 +287,14 @@ static TranslationBlock *tb_find_physical(CPUState *cpu,
 TARGET_PAGE_SIZE;
 phys_page2 = get_page_addr_code(env, virt_page2);
 if (tb->page_addr[1] == phys_page2) {
-break;
+return tb;
 }
 } else {
-break;
+return tb;
 }
 }
-ptb1 = &tb->phys_hash_next;
 }
-
-return tb;
+return NULL;
 }
 
 static TranslationBlock *tb_find_slow(CPUState *cpu,
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 7e4aea7..050e820 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -155,8 +155,8 @@ struct TranslationBlock {
 #define CF_USE_ICOUNT  0x2
 
 void *tc_ptr;/* pointer to the translated code */
-/* next matching tb for physical address. */
-struct TranslationBlock *phys_hash_next;
+/* list node in slot of physically-indexed hash of translation blocks */
+QLIST_ENTRY(TranslationBlock) slot_node;
 /* original tb when cflags has CF_NOCACHE */
 struct TranslationBlock *orig_tb;
 /* first and second physical page containing code. The lower bit
@@ -183,12 +183,18 @@ struct TranslationBlock {
 
 #include "qemu/thread.h"
 
+typedef struct TBPhysHashSlot TBPhysHashSlot;
+
+struct TBPhysHashSlot {
+QLIST_HEAD(, TranslationBlock) list;
+};
+
 typedef struct TBContext TBContext;
 
 struct TBContext {
 
 TranslationBlock *tbs;
-TranslationBlock *tb_phys_hash[CODE_GEN_PHYS_HASH_SIZE];
+TBPhysHashSlot tb_phys_hash[CODE_GEN_PHYS_HASH_SIZE];
 int nb_tbs;
 /* any access to the tbs or the page table must use this lock */
 QemuMutex tb_lock;
diff --git a/translate-all.c b/translate-all.c
index 94adcd0..df65c83 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -60,6 +60,7 @@
 #include "exec/cputlb.h"
 #include "exec/tb-hash.h"
 #include "translate-all.h"
+#include "qemu/rcu_queue.h"
 #include "qemu/bitmap.h"
 #include "qemu/timer.h"
 #include "qemu/aie.h"
@@ -721,6 +722,17 @@ static inline void code_gen_alloc(size_t tb_size)
 qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
 }
 
+static void tb_ctx_init(void)
+{
+int i;
+
+for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) {
+TBPhysHashSlot *slot = &tcg_ctx.tb_ctx.tb_phys_hash[i];
+
+QLIST_INIT(&slot->list);
+}
+}
+
 /* Must be called before using the QEMU cpus. 'tb_size' is the size
(in bytes) allocated to the translation buffer. Zero means default
size. */
@@ -731,6 +743,7 @@ void tcg_exec_init(unsigned long tb_size)
 tcg_ctx.code_gen_ptr = tcg_ctx.code_gen_buffer;
 tcg_register_jit(tcg_ctx.code_gen_buffer, tcg_ctx.code_gen_buffer_size);
 page_init();
+tb_ctx_init();
 aie_init();
 #if !defined(CONFIG_USER_ONLY) || !defined(CONFIG_USE_GUEST_BASE)
 /* There's no guest base to take into account, so go ahead an

Re: [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode

2015-08-24 Thread Emilio G. Cota

On Mon, Aug 24, 2015 at 18:08:37 +0200, Artyom Tarasenko wrote:
> On Mon, Aug 24, 2015 at 2:23 AM, Emilio G. Cota  wrote:
> >   * tb_lock must be held every time code is generated. The rationale is
> > that most of the time QEMU is executing code, not generating it.
> 
> While this is indeed true for an ideal case,  currently there are
> situations where it's not:
>  running a g++ process under qemu-system-sparc64 the comparable amount
> of time is spent on executing and generating the code [1].
> Does this lock imply the translation performance won't gain anything
> when emulating a single core machine on a multi-core one?
>
> 1. https://lists.gnu.org/archive/html/qemu-devel/2015-08/msg02194.html

AFAICT we can't say that's the desired TCG behavior, right? It seems we might
be translating more often than we should for sparc64:
  https://lists.gnu.org/archive/html/qemu-devel/2015-08/msg02531.html

I'll run multi-programmed workloads (i.e. several instances running
at the same time, for instance doing a 'make -j' kernel build) on x86
to see how far up translation can go--in general I'd expect any multi-programmed
workload in full-system mode to require more translations than a
multi-threaded one, since in the latter code is the same for all threads.

But really I'd only expect self-modifying code to be slow/non-scalable--and
I don't think we should worry about it too much.

If you can think of other workloads that might trigger more translations
than usual, please let me know.

> Does this lock imply the translation performance won't gain anything
> when emulating a single core machine on a multi-core one?

The goal so far has been to emulate each VCPU on its own thread; as you
can see in the perf results in this thread this provides huge perf
gains when emulating multi-core guests on large enough hosts.

Code generation is done by the VCPU threads as they need it, and for
that they need to hold a lock to prevent corrupting TCG data structures
--for instance there's a single hash of TB's, and a single code_gen_buffer.

So to answer your question: speeding up a single-core guest on a multi-core
host is not something we're trying to do. If you think about it,
*if* the premise that QEMU is mostly executing (and not translating) code
holds true (and I'd say it holds for most workloads), then one host thread
per VCPU is the right design.

Thanks,

Emilio

[Qemu-devel] [PATCH 3/4] linux-user: call rcu_(un)register_thread on thread creation/deletion

2015-08-24 Thread Emilio G. Cota

Note that the right place to call rcu_register_thread() is
do_cpu_loop() and not just in clone_func(), since the
original 'main' thread needs to call rcu_register_thread()
as well.

Signed-off-by: Emilio G. Cota 
---
 linux-user/qemu.h| 1 +
 linux-user/syscall.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/linux-user/qemu.h b/linux-user/qemu.h
index 8af5e01..08e6609 100644
--- a/linux-user/qemu.h
+++ b/linux-user/qemu.h
@@ -203,6 +203,7 @@ void fork_end(int child);
 
 static inline void do_cpu_loop(CPUArchState *env)
 {
+rcu_register_thread();
 current_cpu = ENV_GET_CPU(env);
 cpu_loop(env);
 }
diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index 701c8fa..84909b4 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -5614,6 +5614,7 @@ abi_long do_syscall(void *cpu_env, int num, abi_long arg1,
 thread_cpu = NULL;
 object_unref(OBJECT(cpu));
 g_free(ts);
+rcu_unregister_thread();
 pthread_exit(NULL);
 }
 #ifdef TARGET_GPROF
-- 
1.9.1

[Qemu-devel] [PATCH 2/4] linux-user: add helper to set current_cpu before cpu_loop()

2015-08-24 Thread Emilio G. Cota

There are as many versions of cpu_loop as architectures supported,
so introduce here a helper that is common to all of them.

Signed-off-by: Emilio G. Cota 
---
 linux-user/main.c| 2 +-
 linux-user/qemu.h| 6 ++
 linux-user/syscall.c | 2 +-
 3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/linux-user/main.c b/linux-user/main.c
index 3e10bd8..24c53ad 100644
--- a/linux-user/main.c
+++ b/linux-user/main.c
@@ -4405,7 +4405,7 @@ int main(int argc, char **argv, char **envp)
 }
 gdb_handlesig(cpu, 0);
 }
-cpu_loop(env);
+do_cpu_loop(env);
 /* never exits */
 return 0;
 }
diff --git a/linux-user/qemu.h b/linux-user/qemu.h
index e8606b2..8af5e01 100644
--- a/linux-user/qemu.h
+++ b/linux-user/qemu.h
@@ -201,6 +201,12 @@ void init_qemu_uname_release(void);
 void fork_start(void);
 void fork_end(int child);
 
+static inline void do_cpu_loop(CPUArchState *env)
+{
+current_cpu = ENV_GET_CPU(env);
+cpu_loop(env);
+}
+
 /* Creates the initial guest address space in the host memory space using
  * the given host start address hint and size.  The guest_start parameter
  * specifies the start address of the guest space.  guest_base will be the
diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index c7062ab..701c8fa 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -4533,7 +4533,7 @@ static void *clone_func(void *arg)
 /* Wait until the parent has finshed initializing the tls state.  */
 pthread_mutex_lock(&clone_lock);
 pthread_mutex_unlock(&clone_lock);
-cpu_loop(env);
+do_cpu_loop(env);
 /* never exits */
 return NULL;
 }
-- 
1.9.1

[Qemu-devel] [PATCH 4/4] bsd-user: add helper to set current_cpu before cpu_loop()

2015-08-24 Thread Emilio G. Cota

Note: cannot compile bsd-user here (linux), please compile-test.

Signed-off-by: Emilio G. Cota 
---
 bsd-user/main.c | 2 +-
 bsd-user/qemu.h | 6 ++
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/bsd-user/main.c b/bsd-user/main.c
index ee68daa..0bea358 100644
--- a/bsd-user/main.c
+++ b/bsd-user/main.c
@@ -1133,7 +1133,7 @@ int main(int argc, char **argv)
 gdbserver_start (gdbstub_port);
 gdb_handlesig(cpu, 0);
 }
-cpu_loop(env);
+do_cpu_loop(env);
 /* never exits */
 return 0;
 }
diff --git a/bsd-user/qemu.h b/bsd-user/qemu.h
index 5902614..751efd5 100644
--- a/bsd-user/qemu.h
+++ b/bsd-user/qemu.h
@@ -163,6 +163,12 @@ int get_osversion(void);
 void fork_start(void);
 void fork_end(int child);
 
+static inline void do_cpu_loop(CPUArchState *env)
+{
+current_cpu = ENV_GET_CPU(env);
+cpu_loop(env);
+}
+
 #include "qemu/log.h"
 
 /* strace.c */
-- 
1.9.1

[Qemu-devel] [PATCH 1/4] cpus: add qemu_cpu_thread_init_common() to avoid code duplication

2015-08-24 Thread Emilio G. Cota

Signed-off-by: Emilio G. Cota 
---
 cpus.c | 32 +---
 1 file changed, 13 insertions(+), 19 deletions(-)

diff --git a/cpus.c b/cpus.c
index 81dda93..fd9e903 100644
--- a/cpus.c
+++ b/cpus.c
@@ -922,18 +922,23 @@ static void qemu_kvm_wait_io_event(CPUState *cpu)
 qemu_wait_io_event_common(cpu);
 }
 
+/* call with BQL held */
+static void qemu_cpu_thread_init_common(CPUState *cpu)
+{
+rcu_register_thread();
+qemu_thread_get_self(cpu->thread);
+cpu->thread_id = qemu_get_thread_id();
+cpu->can_do_io = 1;
+current_cpu = cpu;
+}
+
 static void *qemu_kvm_cpu_thread_fn(void *arg)
 {
 CPUState *cpu = arg;
 int r;
 
-rcu_register_thread();
-
 qemu_mutex_lock_iothread();
-qemu_thread_get_self(cpu->thread);
-cpu->thread_id = qemu_get_thread_id();
-cpu->can_do_io = 1;
-current_cpu = cpu;
+qemu_cpu_thread_init_common(cpu);
 
 r = kvm_init_vcpu(cpu);
 if (r < 0) {
@@ -970,13 +975,8 @@ static void *qemu_dummy_cpu_thread_fn(void *arg)
 sigset_t waitset;
 int r;
 
-rcu_register_thread();
-
 qemu_mutex_lock_iothread();
-qemu_thread_get_self(cpu->thread);
-cpu->thread_id = qemu_get_thread_id();
-cpu->can_do_io = 1;
-current_cpu = cpu;
+qemu_cpu_thread_init_common(cpu);
 
 sigemptyset(&waitset);
 sigaddset(&waitset, SIG_IPI);
@@ -1009,15 +1009,9 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
 {
 CPUState *cpu = arg;
 
-rcu_register_thread();
-
 qemu_mutex_lock_iothread();
-qemu_thread_get_self(cpu->thread);
-
-cpu->thread_id = qemu_get_thread_id();
+qemu_cpu_thread_init_common(cpu);
 cpu->created = true;
-cpu->can_do_io = 1;
-current_cpu = cpu;
 
 qemu_cond_signal(&qemu_cpu_cond);
 
-- 
1.9.1

Re: [Qemu-devel] [RFC 12/38] linux-user: call rcu_(un)register_thread on pthread_(exit|create)

2015-08-24 Thread Emilio G. Cota

On Sun, Aug 23, 2015 at 20:23:41 -0400, Emilio G. Cota wrote:
> Signed-off-by: Emilio G. Cota 
> ---
>  linux-user/syscall.c | 2 ++
>  1 file changed, 2 insertions(+)

Just noticed that this patch is incomplete, since the 'main' thread
doesn't get to call rcu_register_thread()--only its children call it.

This is fixed in patch 3/4 I sent as a reply to your review of patch 3/38,
so you might want to discard this patch from your queue.

Thanks,

Emilio

Re: [Qemu-devel] [RFC 05/38] thread-posix: inline qemu_spin functions

2015-08-24 Thread Emilio G. Cota

On Sun, Aug 23, 2015 at 18:04:46 -0700, Paolo Bonzini wrote:
> On 23/08/2015 17:23, Emilio G. Cota wrote:
> > On some parallel workloads this gives up to a 15% speed improvement.
> > 
> > Signed-off-by: Emilio G. Cota 
> > ---
> >  include/qemu/thread-posix.h | 47 ++
> >  include/qemu/thread.h   |  6 --
> >  util/qemu-thread-posix.c| 50 
> > +
> >  3 files changed, 52 insertions(+), 51 deletions(-)
(snip)
> Applied, but in the end the spinlock will probably simply use a simple
> test-and-test-and-set lock, or an MCS lock.  There is no need to use
> pthreads for this.

Agreed.

In fact in my tests I sometimes use concurrencykit [http://concurrencykit.org/] 
to
test lock alternatives (would love to be able to just add ck as a submodule
of qemu, but they do not support as many architectures as qemu does).

Note that fair locks (such as MCS) for user-space programs are not
necessarily a good idea when preemption is considered--and for usermode
we'd be forced (if we allowed MCS's to nest) to use per-lock stack variables
given that the number of threads is unbounded, which is pretty ugly.

If contention is a problem, a simple, fast spinlock combined with an exponential
backoff is already pretty good. Fairness is not a requirement (the cache
substrate of a NUMA machine isn't necessarily fair, is it?); scalability is.
If the algorithm in the guest requires fairness, the guest must use a fair lock
(e.g. MCS), and that works as intended when run natively or under qemu.

I just tested a fetch-and-swap+exp.backoff spinlock with usermode on a
program that spawns N threads and each thread performs an 2**M atomic increments
on the same variable. That is, a degenerate worst-case kind of contention.
N varies from 1 to 64, and M=15 on all runs, 5 runs per experiment:

  http://imgur.com/XpYctyT
  With backoff, the per-access latency grows roughly linearly with the number of
  cores, i.e. this is scalable. The other two are clearly superlinear.

The fastest spinlock as per ck's documentation (for uncontended cases) is
the fetch-and-swap lock. I just re-ran the usermode experiments from yesterday
with fas and fas+exp.backoff:

  http://imgur.com/OK2WZg8
  There really isn't much difference among the three candidates.

In light of these results I see very little against going for a solution
with exponential backoff.

Thanks,

Emilio

Re: [Qemu-devel] [RFC 14/38] softmmu: add helpers to get ld/st physical addresses

2015-08-24 Thread Emilio G. Cota

On Sun, Aug 23, 2015 at 19:02:30 -0700, Paolo Bonzini wrote:
> On 23/08/2015 17:23, Emilio G. Cota wrote:
> > This will be used by the atomic instruction emulation code.
> 
> Is this a fast path?  If not, we can use the existing addend field and
> convert the host address to a ram_addr_t easily.

On x86 this is a fast path because the helper that goes before every
store (aie_st_pre, p.17/38) checks whether any atomic operations on
the store's cache line have been done before. The check is done using
a bitmap (p.16/38); the input to the bitmap is the physical address with
the last 6 bits shifted out (I'm assuming 2**6=64-byte cache lines).

That said: How would the conversion via the addend field look like?
If it's just an addition it might be cheap enough--and we wouldn't
bloat CPUTLBEntry.

Thanks,

Emilio

Re: [Qemu-devel] [RFC 20/38] tcg/i386: implement fences

2015-08-24 Thread Emilio G. Cota

On Sun, Aug 23, 2015 at 18:32:51 -0700, Paolo Bonzini wrote:
> 
> 
> On 23/08/2015 17:23, Emilio G. Cota wrote:
> > +case INDEX_op_fence_load:
> > +tcg_out_fence(s, 0xe8);
> > +break;
> > +case INDEX_op_fence_full:
> > +tcg_out_fence(s, 0xf0);
> > +break;
> > +case INDEX_op_fence_store:
> > +tcg_out_fence(s, 0xf8);
> > +break;
> > +
> 
> lfence and sfence are not needed in generated code; all loads are
> acquires and all stores are release on x86.

lfence and sfence here serve two purposes:

1) Template for other architectures
2) x86 code does sometimes have lfence/sfence (e.g. movntq+sfence),
   so I guessed they should remain in the translated code.
   If on x86 we always ignore the Write-Combining from the
   guest, maybe we could claim the l/sfence pair here is really unnecessary.
   I'm no x86 expert so for a first stab I decided to put them there.

   I didn't intend to translate say *all* PPC/ARM load barriers
   into lfences when generating x86, which is I think your point.

> Also, on targets that do not have MFENCE you want to generate something
> like "lock addl $0, (%esp)".

Good point, will fix.

Thanks,

Emilio

Re: [Qemu-devel] [PATCH 4/4] bsd-user: add helper to set current_cpu before cpu_loop()

2015-08-25 Thread Emilio G. Cota

On Mon, Aug 24, 2015 at 20:41:10 -0400, Emilio G. Cota wrote:
> Note: cannot compile bsd-user here (linux), please compile-test.
> 
> Signed-off-by: Emilio G. Cota 
> ---
(snip)
> diff --git a/bsd-user/qemu.h b/bsd-user/qemu.h
> index 5902614..751efd5 100644
> --- a/bsd-user/qemu.h
> +++ b/bsd-user/qemu.h
> @@ -163,6 +163,12 @@ int get_osversion(void);
>  void fork_start(void);
>  void fork_end(int child);
>  
> +static inline void do_cpu_loop(CPUArchState *env)
> +{

Here we should also call rcu_register_thread().

> +current_cpu = ENV_GET_CPU(env);
> +cpu_loop(env);
> +}

Emilio

Re: [Qemu-devel] [RFC 05/38] thread-posix: inline qemu_spin functions

2015-08-25 Thread Emilio G. Cota

On Mon, Aug 24, 2015 at 22:30:03 -0400, Emilio G. Cota wrote:
> On Sun, Aug 23, 2015 at 18:04:46 -0700, Paolo Bonzini wrote:
> > On 23/08/2015 17:23, Emilio G. Cota wrote:
> (snip)
> > Applied, but in the end the spinlock will probably simply use a simple
> > test-and-test-and-set lock, or an MCS lock.  There is no need to use
> > pthreads for this.
(snip)
> Note that fair locks (such as MCS) for user-space programs are not
> necessarily a good idea when preemption is considered--and for usermode
> we'd be forced (if we allowed MCS's to nest) to use per-lock stack variables
> given that the number of threads is unbounded, which is pretty ugly.
> 
> If contention is a problem, a simple, fast spinlock combined with an 
> exponential
> backoff is already pretty good. Fairness is not a requirement (the cache
> substrate of a NUMA machine isn't necessarily fair, is it?); scalability is.
> If the algorithm in the guest requires fairness, the guest must use a fair 
> lock
> (e.g. MCS), and that works as intended when run natively or under qemu.
> 
> I just tested a fetch-and-swap+exp.backoff spinlock with usermode on a
> program that spawns N threads and each thread performs an 2**M atomic 
> increments
> on the same variable. That is, a degenerate worst-case kind of contention.
> N varies from 1 to 64, and M=15 on all runs, 5 runs per experiment:
> 
>   http://imgur.com/XpYctyT
>   With backoff, the per-access latency grows roughly linearly with the number 
> of
>   cores, i.e. this is scalable. The other two are clearly superlinear.

Just tried MCS, CLH and ticket spinlocks (with and without backoff).

They take essentially forever for this (admittedly worst-case) test; this
is not suprising when we realise that we're essentially trying to do
in software what the coherence protocol in the cache does (in hardware)
for us when using greedy spinlocks. I'm not even showing numbers because
around N=9 each experiment starts taking waay too long for me to wait.

> In light of these results I see very little against going for a solution
> with exponential backoff.

The only reason I can think of against this option is that we'd be altering
the dynamic behaviour of the emulated code. For example, code that is
not clearly not scalable (such as the example above) would be made
to "scale" by the use of the backoffs (as a result the CPU that gets a lock
is very likely to grab it again right after unlocking). This makes me
a uneasy; it makes intuitive sense to me that unscalable code should
not scale under QEMU, instead of us playing tricks that would confuse
users.

Emilio

Re: [Qemu-devel] [RFC 22/38] cpu: update interrupt_request atomically

2015-08-25 Thread Emilio G. Cota

On Sun, Aug 23, 2015 at 18:09:48 -0700, Paolo Bonzini wrote:
> On 23/08/2015 17:23, Emilio G. Cota wrote:
> > Signed-off-by: Emilio G. Cota 
> > ---
> >  cpu-exec.c |  9 ++---
> >  exec.c |  2 +-
> >  hw/openrisc/cputimer.c |  2 +-
> >  qom/cpu.c  |  4 ++--
> >  target-arm/helper-a64.c|  2 +-
> >  target-arm/helper.c|  2 +-
> >  target-i386/helper.c   |  2 +-
> >  target-i386/seg_helper.c   | 14 +++---
> >  target-i386/svm_helper.c   |  4 ++--
> >  target-openrisc/interrupt_helper.c |  2 +-
> >  target-openrisc/sys_helper.c   |  2 +-
> >  target-ppc/excp_helper.c   |  8 
> >  target-ppc/helper_regs.h   |  2 +-
> >  target-s390x/helper.c  |  2 +-
> >  target-unicore32/softmmu.c |  2 +-
> >  translate-all.c|  4 ++--
> >  16 files changed, 33 insertions(+), 30 deletions(-)
> 
> Is this needed if you have patch 23 anyway?

Sorry, this should have been in the commit log.

This patch is needed as is. One real risk this is protecting
against is the call of cpu_interrupt(cpu_foo) when the calling
thread is not cpu_foo's thread--this write to interrupt_request
might race with other writes, e.g. another call to cpu_interrupt
from another thread, or the clearing of interrupt_request by
cpu_foo.

Patch 23 fixes another issue--bootup hangs without it. The amount
of code wrapped by the iothread lock can be reduced, though.
Will fix.

Thanks,

Emilio

Re: [Qemu-devel] [RFC 24/38] cpu-exec: reset mmap_lock after exiting the CPU loop

2015-08-25 Thread Emilio G. Cota

On Sun, Aug 23, 2015 at 19:01:39 -0700, Paolo Bonzini wrote:
> On 23/08/2015 17:23, Emilio G. Cota wrote:
> > Otherwise after an exception we end up in a deadlock.
> 
> Can you explain better the path that exits cpu_exec with the lock taken?

In fact I cannot :-) So please ignore this patch.

I wrote this while rebasing my code on top of your mttcg branch,
and then it was needed. However, patch 01/38 (which I wrote after
writing this patch) was the right fix, and I forgot to check whether
this patch was still necessary--turns out it wasn't.

> Also, let's remove the recursive locking by introducing "mmap_lock()
> already taken" variants of target_mprotect and target_mmap.

Will do.

Thanks,

Emilio

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2303 matches

Mail list logo