Re: RFC: Telling the middle end about asynchronous/single-sided memory access (Fortran related)

Richard Guenther Fri, 15 Apr 2011 03:03:08 -0700

On Fri, Apr 15, 2011 at 11:52 AM, Janne Blomqvist
<blomqvist.ja...@gmail.com> wrote:
> On Fri, Apr 15, 2011 at 12:02, Tobias Burnus <bur...@net-b.de> wrote:
>> Dear all,
>>
>> I have question how one can and should tell the middle end about
>> asynchonous/single-sided memory access; the goal is to produce fast but
>> race-free code. All the following is about Fortran 2003 (asynchronous) and
>> Fortran 2008 (coarrays), but the problem itself should occur with all(?)
>> supported languages. It definitely occurs when using C with MPI or using C
>> with the asynchronous I/O functions, though I do not know in how far one
>> currently relys on luck.
>>
>> There are two issues, which need to be solved by informing the middle end -
>> either by variable attributes or by inserted function calls:
>> - Prohibiting some code movements
>> - Making assumptions about the memory content
>>
>> a) ASYNCHRONOUS attribute and asynchronous I/O
>>
>> Fortran allows asynchronous I/O, which means for the programmer that between
>> initiating the asynchronous reading/writing and the finishing read/write,
>> the variable may not be accessed (for READ) or not be changed (for WRITE).
>> The compiler needs to make sure that it does not move code such that this
>> constraint is violated. All variables involved in asynchronous operations
>> are marked as ASYNCHRONOUS.
>>
>> Thus, for asynchronous operations, code movements involving opaque function
>> calls should not happen - but contrary to VOLATILE, there is no need to take
>> the value all time from the memory if it is still in the register.
>>
>> Example:
>>
>>   integer, ASYNCHRONOUS :: async_int
>>
>>   WRITE (unit, ASYNCHRONOUS='yes') async_int
>>   ! ...
>>   WAIT (unit)
>>   a = async_int
>>   do i = 1, 10
>>     b(i) = async_int + 1
>>   end do
>>
>> Here, "a = async_int" may not be moved before the WAIT line. However,
>> contrary to VOLATILE,  one can move the "async_int + 1" before the loop and
>> use the value from the registry in the loop. Note additionally that the
>> initiation of an asynchronous operation (WRITE statement above) is known at
>> compile time; however, it is not known when it ends - the
>> WAIT can be in a different translation unit. See also PR 25829.
>>
>> The Fortran 2008 standard is not very explicit about the ASYNCHRONOUS
>> attribute itself; it simply states that it is for asynchronous I/O.
>> (However, it describes then how async I/O works,
>> including WAIT, INQUIRE, and what a programmer may do until the async I/O is
>> finished.) The closed to an ASYNCHRONOUS definition is the non-normative
>> note 5.4 of Fortran 2008:
>>
>> "The ASYNCHRONOUS attribute specifies the variables that might be associated
>> with a pending input/output storage sequence (the actual memory locations on
>> which asynchronous input/output is being performed) while the scoping unit
>> is in execution. This information could be used by the compiler to disable
>> certain code motion optimizations."
>>
>> Seemingly intended, but not that clear in the F2003/F2008 standard, is to
>> allow for asynchronous user operations; this will presumbly refined in TR
>> 29113 which is currently being drafted - and/or in an interpretation
>> request. The main requestee for this feature is the MPI Forum, which works
>> on MPI3. In any case the following should work analogously and "buf" should
>> not be moved before the "MPI_Wait" line:
>>
>>  CALL MPI_Irecv(buf, rq)
>>  CALL MPI_Wait(rq)
>>  xnew=buf
>>
>> Hereby, "buf" and (maybe?) the first dummy argument of MPI_Irecv have the
>> ASYNCHRONOUS attribute.
>>
>> My question is now: How to properly tell this the middle end?
>>
>> VOLATILE seems to be wrong as it prevents way too many optimizations and I
>> think it does not completely prevent code moving. Using a call to some
>> built-in function does not work as in principle the end of an asynchronous
>> operation is not known. It could end with a WAIT - possibly also wrapped in
>> a function, which is in a different translation unit - or also with an
>> INQUIRE(..., PENDING=aio_pending) if "aio_pending" gets assigned a .false.
>>
>> (Frankly, I am not 100% sure about the exact semantics of ASYNCHRONOUS; I
>> think might be implemented by preventing all code movements which involve
>> swapping an ASYNCHRONOUS variable with a function call, which is not pure.
>> Otherwise, in terms of the variable value, it acts like a normal variable,
>> i.e. if one does: "a = 7" and does not set "a" afterwards (assignment or via
>> function calls), it remains 7. The changing of the variable is explicit -
>> even if it only becomes effective with some delay.)
>
> A compiler memory barrier in GCC is
>
> asm volatile("" ::: "memory");
>
> That being said, does the middle end really move loads and stores past
> function calls? If not, a call is effectively also a compiler memory
> barrier, no?


Yes, a function call is also a compiler memory barrier for all memory
that the function can possibly access.  There is currently no easy way
to restrict the barrierness to a certain kind of memory (like memory
marked with ASYNCHRONOUS), but it will simply act as a barrier
for all escaped memory (and variables for I/O escape as they are
passed by reference to the I/O functions).

>> B) COARRAYS
>>
>> The memory model of coarrays is that all memory is private to the image -
>> except for coarrays. Coarrays exists on all images. For "integer ::
>> coarray(:)[*]", local accesses are "coarray = ..." or "coarray(4) = ..."
>> while remote accesses are "coarray(:)[7] = ..." or "a = coarray(3)[2]",
>> where the data is set on image 7 or pulled from image 2.
>>
>> Let's start directly with an example:
>>
>>   module m
>>     integer, save :: caf_int[*]  ! Global variable
>>   end module m
>>
>>   subroutine foo()
>>     use m
>>     caf_int = 7  ! Set local variable to 7 (effectively: on image 1 only)
>>     SYNC ALL ! Memory barrier/fence
>>     SYNC ALL
>>     ! caf_int should now be 8, cf. below; thus the following if shall
>>     ! neither optimized way not be executed at run time.
>>     if (caf_int == 7) call abort()
>>   end subroutine foo
>>
>>   subroutine bar()
>>     use m
>>     SYNC ALL
>>     caf_int[1] = 8 ! Set variable on image 1 to 8
>>     SYNC ALL
>>   end subroutine bar
>>
>>   program caf_example
>>     if (this_image() == 1) CALL foo()
>>     if (this_image() == 2) CALL bar()
>>   end program caf_example
>>
>> Notes:
>> - The coarray "caf_int" will be registered in the communication library at
>> startup of the main program.
>> - For image 1 one always accesses "caf_int" in local memory. The variable
>> also does not alias with anything - except that the value might change via
>> single-sided communication.
>>
>> Thus: SYNC ALL acts as memory fence - for coarrays only. In principle, all
>> other variables might be moved across the fence. Besides preventing code
>> moves, the value of the variable cannot be assumed to be the same as before
>> the fence. I think a simple call to "__sync_synchronize()" (alias
>> BUILT_IN_SYNCHRONIZE) should take care of this, but I want to
>> confirm that it indeed does so. I assume I can still keep all coarrays as
>> restricted pointers - even though there is single-sided communication.
>>
>> Q1: Is __sync_synchronize() sufficient?
>
> I don't think this is correct. __sync_synchronize() just issues a
> hardware memory fence instruction. That is, it prevents loads and
> stores from moving past the fence *on the processor that executes the
> fence instruction*. There is no synchronization with other
> processors. SYNC ALL, OTOH is a full barrier; the implementation must
> be something like a counter that every co-image increments atomically,
> and then waits (blocking or spinning) until the counter equals the
> number of co-images.

Indeed, it looks like you need something heavier, like a mutex.  You
can probably piggy-back on the OpenMP support for barriers.

>> Q2: Can this be optimized in some way?

For simple types you could use atomic instructions for the modification
itself instead of two SYNC ALL calls.

> Probably not. For general issues with the shared-memory model, perhaps
> shared memory Co-arrays can piggyback on the work being done for the
> C++0x memory model, see
>
> http://gcc.gnu.org/wiki/Atomic/GCCMM
>
> Hans Boehm maintains a collection of links to various papers on the topic at
>
> http://www.hpl.hp.com/personal/Hans_Boehm/c++mm/
>
>
> --
> Janne Blomqvist
>

Re: RFC: Telling the middle end about asynchronous/single-sided memory access (Fortran related)

Reply via email to