Re: RFC: Telling the middle end about asynchronous/single-sided memory access (Fortran related)

Janne Blomqvist Fri, 15 Apr 2011 11:30:06 -0700

On Fri, Apr 15, 2011 at 15:04, Tobias Burnus <bur...@net-b.de> wrote:
> On 04/15/2011 11:52 AM, Janne Blomqvist wrote:
>>
>> Q1: Is __sync_synchronize() sufficient?
>> I don't think this is correct. __sync_synchronize() just issues a
>> hardware memory fence instruction.That is, it prevents loads and
>> stores from moving past the fence *on the processor that executes the
>> fence instruction*. There is no synchronization with other
>> processors.
>
> Well, I was thinking of (a) assumptions regarding the value for the compiler
> when doing optimizations. And (b) making sure that the variables are really
> loaded from memory and not remain in the register. -- How the data ends up
> in memory is a different question; for the current library version, SYNC ALL
> would be a __sync_synchronize() followed by a (wrapped) call to MPI_Barrier
> - and possibly some additional actions.
>
>>> Q2: Can this be optimized in some way?
>>
>> Probably not. For general issues with the shared-memory model, perhaps
>> shared memory Co-arrays can piggyback on the work being done for the
>> C++0x memory model, see
>
> I think you try to solve a different problem than I want.


Indeed, I assumed you were discussing how to implement CAF via shared
memory. If we use MPI, surely the implementation of MPI_Barrier should
itself issue any necessary memory fences (if it uses shared memory),
so I don't think __sync_synchronize() would be necessary? And, as
Richi already mentioned, the function call itself is an implicit
compiler memory barrier for all variables which might be accessed by
the callee. Which implies that any such variables must be flushed to
memory before the call and reloaded if read after the call returns.
So, in this case I don't think there is anything to worry about.

FWIW, the optimization implications of the C++0x (and C1X) memory
model I referred to earlier refers mostly to avoiding compiler
introduced data races for potentially shared variables. That is, if
the compiler cannot prove that a variable is accessible only from the
current thread (e.g. non-escaped stack variables), some optimizations
are forbidden. But, to the extent that all cross-co-image access
happens via calling MPI procedures, I don't think this will affect the
CAF implementation. However, if we ever make a shared memory CAF
backend, then it might matter (and, one hopes, by that time the GCC
C++ memory model work is further along).

> * For ASYNCHRONOUS, one mostly does not need to do anything. Except that for
> the asynchronous version of the transfer function belonging to READ and
> WRITE, the data argument needs to be marked as escaping in the "fn spec"
> attribute. Similarly, for ASYNCHRONOUS dummy arguments, the "fn spec" must
> be such that the compiler knows the the address could be escaping. (I don't
> think there is currently a way to mark via "fn spec" a variable as escaping
> but only be used for reading the value - or to restrict the scope of the
> escaping.)

Wrt AIO, as you know I have now and then worked on an implementation.
Though by now it's again probably more than half a year since I looked
into it. But, to be honest, the more I think about it the less
convinced I am about the usefulness of it.

As none of the alternatives is really satisfactory (including at least
some semblance of portability), the implementation is sort of a lowest
common denominator, based on the thread pool and work queue pattern.
Which means there is overhead to to the threads (e.g. context
switching, locking etc.). OTOH, plain blocking IO is in some sense
"asynchronous" as well: A write() is essentially a memcpy() from user
space to the kernel page cache, and the kernel takes care of writing
out the data to permanent storage after write() returns. For read()'s
the situation is somewhat similar, except there is the possibility
that the data is not found in the page cache and must be read from
disk. However in this case if the read()'s are sequential the kernel
will figure it out and prefetch data into the page cache. So that
leaves basically uncached random reads as the use case for AIO. So how
common is that kind of performance bottleneck in Fortran applications?
Might be different if Fortran were widely used for implementing
event-based network servers, but it isn't.

This doesn't mean that the frontend support for ASYNCHRONOUS is
useless; MPI_ISend/Recv are certainly a common existing usecase.

-- 
Janne Blomqvist

Re: RFC: Telling the middle end about asynchronous/single-sided memory access (Fortran related)

Reply via email to