On Wed, 2013-06-19 at 11:31 +0200, Paolo Bonzini wrote: > Il 18/06/2013 18:38, Torvald Riegel ha scritto: > > I don't think that this is the conclusion here. I strongly suggest to > > just go with the C11/C++11 model, instead of rolling your own or trying > > to replicate the Java model. That would also allow you to just point to > > the C11 model and any information / tutorials about it instead of having > > to document your own (see the patch), and you can make use of any > > (future) tool support (e.g., race detectors). > > I'm definitely not rolling my own, but I think there is some value in > using the Java model. Warning: the explanation came out quite > verbose... tl;dr at the end.
That's fine -- it is a nontrival topic after all. My reply is equally lengthy :) > > > One reason is that implementing SC for POWER is quite expensive, Sure, but you don't have to use SC fences or atomics if you don't want them. Note that C11/C++11 as well as the __atomic* builtins allow you to specify a memory order. It's perfectly fine to use acquire fences or release fences. There shouldn't be just one kind of barrier/fence. > while > this is not the case for Java volatile (which I'm still not convinced is > acq-rel, because it also orders volatile stores and volatile loads). But you have the same choice with C11/C++11. You seemed to be fine with using acquire/release in your code, so if that's what you want, just use it :) I vaguely remember that the Java model gives stronger guarantees than C/C++ in the presence of data races; something about no values out-of-thin-air. Maybe that's where the stronger-than-acq/rel memory orders for volatile stores come from. > People working on QEMU are often used to manually placed barriers on > Linux, and Linux barriers do not fully give you seq-cst semantics. They > give you something much more similar to the Java model. Note that there is a reason why C11/C++11 don't just have barriers combined with ordinary memory accesses: The compiler needs to be aware which accesses are sequential code (so it can assume that they are data-race-free) and which are potentially concurrent with other accesses to the same data. When you just use normal memory accesses with barriers, you're relying on non-specified behavior in a compiler, and you have no guarantee that the compiler reads a value just once, for example; you can try to make this very likely be correct by careful placement of asm compiler barriers, but this is likely to be more difficult than just using atomics, which will do the right thing. Linux folks seem to be doing fine in this area, but they also seem to mostly use existing concurrency abstractions such as locks or RCU. Thus, maybe it's not a too good indication of the ease-of-use of their style of expressing low-level synchronization. > The Java model gives good performance and is easier to understand than > the non-seqcst modes of atomic builtins. It is pretty much impossible > to understand the latter without a formal model; I'm certainly biased because I've been looking at this a lot. But I believe that there are common synchronization idioms which are relatively straightforward to understand. Acquire/release pairs should be one such case. Locks are essentially the same everywhere. > I see the importance of > a formal model, but at the same time it is hard not to appreciate the > detailed-but-practical style of the Linux documentation. I don't think the inherent complexity programmers *need* to deal with is different because in the end, if you have two ways to express the same ordering guarantees, you have to reason about the very same ordering guarantees. This is what makes up most of the complexity. For example, if you want to use acq/rel widely, you need to understand what this means for your concurrent code; this won't change significantly depending on how you express it. I think the C++11 model is pretty compact, meaning no unnecessary fluff (maybe one would need fewer memory orders, but those aren't too hard to ignore). I believe that if you'd want to specify the Linux model including any implicit assumptions on the compilers, you'd end up with the same level of detail in the model. Maybe the issue that you see with C11/C++11 is that it offers more than you actually need. If you can summarize what kind of synchronization / concurrent code you are primarily looking at, I can try to help outline a subset of it (i.e., something like code style but just for synchronization). > Second, the Java model has very good "practical" documentation from > sources I trust. Note the part about trust: I found way too many Java > tutorials, newsgroup posts, and blogs that say Java is SC, when it is not. > > Paul's Linux docs are a source I trust, and the JSR-133 FAQ/cookbook too > (especially now that Richard and Paul completed my understanding of > them). There are substantially fewer practical documents for C11/C++11 > that are similarly authoritative. That's true. But both models are younger in the sense of being available to practitioners. Boehm's libatomic-ops library was close, but compiler support for them wasn't available until recently. But maybe it's useful to look 1 or 2 years ahead; I suspect that there will be much more documentation in the near future. > I obviously trust Cambridge for > C11/C++11, but their material is very concise or just refers to the > formal model. Yes, their publications are really about the model. It's not a tutorial, but useful for reference. BTW, have you read their C++ paper http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3132.pdf or the POPL paper? The former has more detail (no page limit). If you haven't yet, I suggest giving their cppmem tool a try too: http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/ It's a browser-based tool that shows you all the possible interleavings of small pieces of synchronization code, together with a graph with the relationships as in their model. Very helpful to explore whether your synchronization code does the right thing. At least, I'm not aware of anything like that for Java. > The formal model is not what I want when my question is > simply "why is lwsync good for acquire and release, but not for > seqcst?", for example. But that's a question that is not about the programming-language-level memory model, but about POWER's memory model. I suppose that's not something you'd have to deal with frequently, or are you indeed primarily interested in emulating a particular architecture's model? The Cambridge group also has a formal model for POWER's memory model, perhaps this can help answer this question. > And the papers sometime refer to "private > communication" between the authors and other people, which can be > annoying. Hans Boehm and Herb Sutter have good poster and slide > material, but they do not have the same level of completeness as Paul's > Linux documentation. Paul _really_ has spoiled us "pure practitioners"... > > > Third, we must support old GCC (even as old as 4.2), so we need > hand-written assembly for atomics anyway. Perhaps you could make use of GCC's libatomic, or at least take code from it. Boehm's libatomic-ops might be another choice. But I agree that you can't just switch to the newer atomics support; it will take some more time until it being available is the norm. > This again goes back to > documentation and the JSR-133 cookbook. It not only gives you > instructions on how to implement the model (which is also true for the > Cambridge web pages on C11/C++11), but is also a good base for writing > our own documentation. It helped me understanding existing code using > barriers, optimizing it, and putting this knowledge in words. I just > couldn't find anything as useful for C11/C++11. > > > In short, the C11/C++11 model is not what most developers are used to > here I guess so. But you also have to consider the legacy that you create. I do think the C11/C++11 model will used widely, and more and more people will used to it. Thus, it's the question of whether you want to deviate from what will likely be the common approach in the not to far away future. > hardware is not 100% mature for it (for example ARMv8 has seqcst > load/store; perhaps POWER will grow that in time), I'm not quite sure what you mean, or at least how what you might mean is related to the choice between the C/C++ and Java models. > is harder to > optimize, and has (as of 2013) less "practical" documentation from > sources I trust. > > Besides, since what I'm using is weaker than SC, there's always the > possibility of switching to SC in the future when enough of these issues > are solved. C11/C++11 is not _equal to_ SC. It includes SC (because it has to anyway to enable certain concurrent algorithms), and C++ has SC as default, but this doesn't mean that the C/C++ model equates to SC semantics. > In case you really need SC _now_, it is easy to do it using > fetch-and-add (for loads) or xchg (for stores). Not on POWER and other architectures with similarly weak memory models :) On x86 and such the atomic read-modify-write ops give you strong ordering, but that's an artifact of those models. You can get SC on most of the archs without having to use atomic read-modify-write ops. If the fetch-and-add abstractions in your model always give you SC, then that's more an indication that they don't allow you the fine-grained control that you want, especially on POWER :) > > >> I will just not use __atomic_load/__atomic_store to implement the > >> primitives, and always express them in terms of memory barriers. > > > > Why? (If there's some QEMU-specific reason, just let me know; I know > > little about QEMU..) > > I guess I mentioned the QEMU-specific reasons above. > > > I would assume that using the __atomic* builtins is just fine if they're > > available. > > It would implement slightly different semantics based on the compiler > version, so I think it's dangerous. If __atomic is available, the semantics don't change. The __sync builtins do provide a certain subset of the semantics you get with the __atomic builtins, however. Torvald