Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)

Torvald Riegel Wed, 19 Jun 2013 06:16:42 -0700

On Wed, 2013-06-19 at 11:31 +0200, Paolo Bonzini wrote:
> Il 18/06/2013 18:38, Torvald Riegel ha scritto:
> > I don't think that this is the conclusion here.  I strongly suggest to
> > just go with the C11/C++11 model, instead of rolling your own or trying
> > to replicate the Java model.  That would also allow you to just point to
> > the C11 model and any information / tutorials about it instead of having
> > to document your own (see the patch), and you can make use of any
> > (future) tool support (e.g., race detectors).
> 
> I'm definitely not rolling my own, but I think there is some value in
> using the Java model.  Warning: the explanation came out quite
> verbose... tl;dr at the end.


That's fine -- it is a nontrival topic after all.  My reply is equally
lengthy :)

> 
> 
> One reason is that implementing SC for POWER is quite expensive,

Sure, but you don't have to use SC fences or atomics if you don't want
them.  Note that C11/C++11 as well as the __atomic* builtins allow you
to specify a memory order.  It's perfectly fine to use acquire fences or
release fences.  There shouldn't be just one kind of barrier/fence.

> while
> this is not the case for Java volatile (which I'm still not convinced is
> acq-rel, because it also orders volatile stores and volatile loads).

But you have the same choice with C11/C++11.  You seemed to be fine with
using acquire/release in your code, so if that's what you want, just use
it :)

I vaguely remember that the Java model gives stronger guarantees than
C/C++ in the presence of data races; something about no values
out-of-thin-air.  Maybe that's where the stronger-than-acq/rel memory
orders for volatile stores come from.

> People working on QEMU are often used to manually placed barriers on
> Linux, and Linux barriers do not fully give you seq-cst semantics.  They
> give you something much more similar to the Java model.

Note that there is a reason why C11/C++11 don't just have barriers
combined with ordinary memory accesses: The compiler needs to be aware
which accesses are sequential code (so it can assume that they are
data-race-free) and which are potentially concurrent with other accesses
to the same data.  When you just use normal memory accesses with
barriers, you're relying on non-specified behavior in a compiler, and
you have no guarantee that the compiler reads a value just once, for
example; you can try to make this very likely be correct by careful
placement of asm compiler barriers, but this is likely to be more
difficult than just using atomics, which will do the right thing.

Linux folks seem to be doing fine in this area, but they also seem to
mostly use existing concurrency abstractions such as locks or RCU.
Thus, maybe it's not a too good indication of the ease-of-use of their
style of expressing low-level synchronization.

> The Java model gives good performance and is easier to understand than
> the non-seqcst modes of atomic builtins.  It is pretty much impossible
> to understand the latter without a formal model;

I'm certainly biased because I've been looking at this a lot.  But I
believe that there are common synchronization idioms which are
relatively straightforward to understand.  Acquire/release pairs should
be one such case.  Locks are essentially the same everywhere.

> I see the importance of
> a formal model, but at the same time it is hard not to appreciate the
> detailed-but-practical style of the Linux documentation.

I don't think the inherent complexity programmers *need* to deal with is
different because in the end, if you have two ways to express the same
ordering guarantees, you have to reason about the very same ordering
guarantees.  This is what makes up most of the complexity.  For example,
if you want to use acq/rel widely, you need to understand what this
means for your concurrent code; this won't change significantly
depending on how you express it.

I think the C++11 model is pretty compact, meaning no unnecessary fluff
(maybe one would need fewer memory orders, but those aren't too hard to
ignore).  I believe that if you'd want to specify the Linux model
including any implicit assumptions on the compilers, you'd end up with
the same level of detail in the model.

Maybe the issue that you see with C11/C++11 is that it offers more than
you actually need.  If you can summarize what kind of synchronization /
concurrent code you are primarily looking at, I can try to help outline
a subset of it (i.e., something like code style but just for
synchronization).

> Second, the Java model has very good "practical" documentation from
> sources I trust.  Note the part about trust: I found way too many Java
> tutorials, newsgroup posts, and blogs that say Java is SC, when it is not.
> 
> Paul's Linux docs are a source I trust, and the JSR-133 FAQ/cookbook too
> (especially now that Richard and Paul completed my understanding of
> them).  There are substantially fewer practical documents for C11/C++11
> that are similarly authoritative.

That's true.  But both models are younger in the sense of being
available to practitioners.  Boehm's libatomic-ops library was close,
but compiler support for them wasn't available until recently.  But
maybe it's useful to look 1 or 2 years ahead; I suspect that there will
be much more documentation in the near future.

> I obviously trust Cambridge for
> C11/C++11, but their material is very concise or just refers to the
> formal model.

Yes, their publications are really about the model.  It's not a
tutorial, but useful for reference.  BTW, have you read their C++ paper
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3132.pdf
or the POPL paper?  The former has more detail (no page limit).

If you haven't yet, I suggest giving their cppmem tool a try too:
http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/
It's a browser-based tool that shows you all the possible interleavings
of small pieces of synchronization code, together with a graph with the
relationships as in their model.  Very helpful to explore whether your
synchronization code does the right thing.  At least, I'm not aware of
anything like that for Java.

> The formal model is not what I want when my question is
> simply "why is lwsync good for acquire and release, but not for
> seqcst?", for example.

But that's a question that is not about the programming-language-level
memory model, but about POWER's memory model.  I suppose that's not
something you'd have to deal with frequently, or are you indeed
primarily interested in emulating a particular architecture's model?
The Cambridge group also has a formal model for POWER's memory model,
perhaps this can help answer this question.

> And the papers sometime refer to "private
> communication" between the authors and other people, which can be
> annoying.  Hans Boehm and Herb Sutter have good poster and slide
> material, but they do not have the same level of completeness as Paul's
> Linux documentation.  Paul _really_ has spoiled us "pure practitioners"...
> 
> 
> Third, we must support old GCC (even as old as 4.2), so we need
> hand-written assembly for atomics anyway.

Perhaps you could make use of GCC's libatomic, or at least take code
from it.  Boehm's libatomic-ops might be another choice.  But I agree
that you can't just switch to the newer atomics support; it will take
some more time until it being available is the norm.

> This again goes back to
> documentation and the JSR-133 cookbook.  It not only gives you
> instructions on how to implement the model (which is also true for the
> Cambridge web pages on C11/C++11), but is also a good base for writing
> our own documentation.  It helped me understanding existing code using
> barriers, optimizing it, and putting this knowledge in words.  I just
> couldn't find anything as useful for C11/C++11.
> 
> 
> In short, the C11/C++11 model is not what most developers are used to
> here

I guess so.  But you also have to consider the legacy that you create.
I do think the C11/C++11 model will used widely, and more and more
people will used to it.  Thus, it's the question of whether you want to
deviate from what will likely be the common approach in the not to far
away future.

> hardware is not 100% mature for it (for example ARMv8 has seqcst
> load/store; perhaps POWER will grow that in time),

I'm not quite sure what you mean, or at least how what you might mean is
related to the choice between the C/C++ and Java models.

> is harder to
> optimize, and has (as of 2013) less "practical" documentation from
> sources I trust.
> 
> Besides, since what I'm using is weaker than SC, there's always the
> possibility of switching to SC in the future when enough of these issues
> are solved.

C11/C++11 is not _equal to_ SC.  It includes SC (because it has to
anyway to enable certain concurrent algorithms), and C++ has SC as
default, but this doesn't mean that the C/C++ model equates to SC
semantics.

> In case you really need SC _now_, it is easy to do it using
> fetch-and-add (for loads) or xchg (for stores).

Not on POWER and other architectures with similarly weak memory
models :)  On x86 and such the atomic read-modify-write ops give you
strong ordering, but that's an artifact of those models.  You can get SC
on most of the archs without having to use atomic read-modify-write ops.

If the fetch-and-add abstractions in your model always give you SC, then
that's more an indication that they don't allow you the fine-grained
control that you want, especially on POWER :)

> 
> >> I will just not use __atomic_load/__atomic_store to implement the 
> >> primitives, and always express them in terms of memory barriers.
> > 
> > Why?  (If there's some QEMU-specific reason, just let me know; I know
> > little about QEMU..)
> 
> I guess I mentioned the QEMU-specific reasons above.
> 
> > I would assume that using the __atomic* builtins is just fine if they're
> > available.
> 
> It would implement slightly different semantics based on the compiler
> version, so I think it's dangerous.

If __atomic is available, the semantics don't change.  The __sync
builtins do provide a certain subset of the semantics you get with the
__atomic builtins, however.


Torvald

Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)

Reply via email to