On Fri, Apr 15, 2011 at 15:04, Tobias Burnus <bur...@net-b.de> wrote: > On 04/15/2011 11:52 AM, Janne Blomqvist wrote: >> >> Q1: Is __sync_synchronize() sufficient? >> I don't think this is correct. __sync_synchronize() just issues a >> hardware memory fence instruction.That is, it prevents loads and >> stores from moving past the fence *on the processor that executes the >> fence instruction*. There is no synchronization with other >> processors. > > Well, I was thinking of (a) assumptions regarding the value for the compiler > when doing optimizations. And (b) making sure that the variables are really > loaded from memory and not remain in the register. -- How the data ends up > in memory is a different question; for the current library version, SYNC ALL > would be a __sync_synchronize() followed by a (wrapped) call to MPI_Barrier > - and possibly some additional actions. > >>> Q2: Can this be optimized in some way? >> >> Probably not. For general issues with the shared-memory model, perhaps >> shared memory Co-arrays can piggyback on the work being done for the >> C++0x memory model, see > > I think you try to solve a different problem than I want.
Indeed, I assumed you were discussing how to implement CAF via shared memory. If we use MPI, surely the implementation of MPI_Barrier should itself issue any necessary memory fences (if it uses shared memory), so I don't think __sync_synchronize() would be necessary? And, as Richi already mentioned, the function call itself is an implicit compiler memory barrier for all variables which might be accessed by the callee. Which implies that any such variables must be flushed to memory before the call and reloaded if read after the call returns. So, in this case I don't think there is anything to worry about. FWIW, the optimization implications of the C++0x (and C1X) memory model I referred to earlier refers mostly to avoiding compiler introduced data races for potentially shared variables. That is, if the compiler cannot prove that a variable is accessible only from the current thread (e.g. non-escaped stack variables), some optimizations are forbidden. But, to the extent that all cross-co-image access happens via calling MPI procedures, I don't think this will affect the CAF implementation. However, if we ever make a shared memory CAF backend, then it might matter (and, one hopes, by that time the GCC C++ memory model work is further along). > * For ASYNCHRONOUS, one mostly does not need to do anything. Except that for > the asynchronous version of the transfer function belonging to READ and > WRITE, the data argument needs to be marked as escaping in the "fn spec" > attribute. Similarly, for ASYNCHRONOUS dummy arguments, the "fn spec" must > be such that the compiler knows the the address could be escaping. (I don't > think there is currently a way to mark via "fn spec" a variable as escaping > but only be used for reading the value - or to restrict the scope of the > escaping.) Wrt AIO, as you know I have now and then worked on an implementation. Though by now it's again probably more than half a year since I looked into it. But, to be honest, the more I think about it the less convinced I am about the usefulness of it. As none of the alternatives is really satisfactory (including at least some semblance of portability), the implementation is sort of a lowest common denominator, based on the thread pool and work queue pattern. Which means there is overhead to to the threads (e.g. context switching, locking etc.). OTOH, plain blocking IO is in some sense "asynchronous" as well: A write() is essentially a memcpy() from user space to the kernel page cache, and the kernel takes care of writing out the data to permanent storage after write() returns. For read()'s the situation is somewhat similar, except there is the possibility that the data is not found in the page cache and must be read from disk. However in this case if the read()'s are sequential the kernel will figure it out and prefetch data into the page cache. So that leaves basically uncached random reads as the use case for AIO. So how common is that kind of performance bottleneck in Fortran applications? Might be different if Fortran were widely used for implementing event-based network servers, but it isn't. This doesn't mean that the frontend support for ASYNCHRONOUS is useless; MPI_ISend/Recv are certainly a common existing usecase. -- Janne Blomqvist