On Tue, 9 Feb 2021 16:37:31 +0100
Thomas Schwinge <tho...@codesourcery.com> wrote:

> Hi!
> 
> On 2021-02-09T13:05:22+0000, Julian Brown <jul...@codesourcery.com>
> wrote:
> > On Tue, 9 Feb 2021 13:45:36 +0100
> > Tobias Burnus <tob...@codesourcery.com> wrote:
> >  
> >> On 09.02.21 12:58, Thomas Schwinge wrote:  
> >> >> Granted. The array(:)%re access might update too much, but
> >> >> that's not different to array with strides or with contiguous
> >> >> arrays sections which contain component reference (and more
> >> >> than one component).    
> >> > (Is that indeed allowed to "update too much"?)    
> >> 
> >> Yes - that's the general problem with strides or bit sets;
> >> copying only a subset – and doing so atomically – is not
> >> always possible or feasible.
> >> 
> >> *OpenACC* 3.1 has for "2.14.4  Update Directive" the restriction:
> >> 
> >> "Noncontiguous subarrays may appear. It is implementation-specific
> >>   whether noncontiguous regions are updated by using one transfer
> >>   for each contiguous subregion, or whether the noncontiguous data
> >>   is packed, transferred once, and unpacked, or whether one or more
> >>   larger subarrays (no larger than the smallest contiguous region
> >>   that contains the specified subarray) are updated."
> >> 
> >> For map, I saw that that's the case – but I think Julian's
> >> patch does not handle this correctly for:
> >> 
> >> type t
> >>    integer :: i, j, k
> >> end type t
> >> type(t) :: A(100)
> >>    ... host(A(:)%j)  
> 
> So I understand the variants in the quoted OpenACC part as follows.
> However I don't claim that I necessarily got all the fine detail
> right at the language level (English as well as OpenACC)!  So, please
> verify.
> 
> "Using one transfer for each contiguous subregion":
> 
>     for n in 1..100: transfer A(n)%j individually
> 
> "Noncontiguous data is packed, transferred once, and unpacked":
> 
>     device:
>       integer :: tmp(100)
>       for n in 1..100: tmp(n) = A(n)%j
>     transfer tmp
>     host:
>       for n in 1..100: A(n)%j = tmp(n)
> 
> In this example here, I understand "subarrays (no larger than the
> smallest contiguous region that contains the specified subarray)" to
> again resolve to 'A(n)%j', so doesn't add other another variant.
> 
> This -- per my reading! -- would be different here:
> 
>     type t
>        integer :: i, j1, j2, k
>     end type t
>     type(t) :: A(100)
>        ... host(A(:)%j1, A(:)%j2)
> 
> ... where I understand this to mean that each 'A(n)%j1' plus 'A(:)%j2'
> may be transferred together: either "using one transfer for each
> contiguous subregion":

I think you're overthinking it. My reading is that "each contiguous
subregion" just means, e.g. each "j" in:

  !$acc update host(A(:)%j)

although your case with multiple updates to adjacent fields would be
possible too, I guess.

>     for n in 1..100: transfer memory region of A(n)%j1..A(n)%j2
> individually
> 
> ..., or "packed, transferred once, and unpacked":
> 
>     device:
>       integer :: tmp(2 * 100)
>       for n in 1..100:
>         tmp(2 * n) = A(n)%j1
>         tmp(2 * n + 1) = A(n)%j2
>     transfer tmp
>     host:
>       for n in 1..100:
>         A(n)%j1 = tmp(2 * n)
>         A(n)%j2 = tmp(2 * n + 1)

Yes, fine, but...

> I do however not read into this that the following would be valid:
> 
> >> I think instead of transferring A(1)%j to A(100)%j, it transfers
> >> all of A(:), i.e. also A(1)%i and A(100)%k :-(  
> 
> I don't think it's appropriate for an 'update' to alter anything else
> than the exact 'var's as specified.

I disagree here. I think the "one or more larger subarrays" clause is
meant to indicate that a single transfer covering all the data (with
gaps) is acceptable. In terms of implementation -- much of the time
that will likely be more efficient than either transferring lots of
tiny fragments, or copying via an intermediate contiguous buffer, so
your reading of the spec might be shooting ourselves in the foot
somewhat.

One could imagine a heuristic that chooses between one large ("gappy")
transfer and, say, copying via a temporary buffer though, using the
latter if say the density of data transferred is less than some
percentage of the full amount.

> > Yes it will -- but given that A(2)%i and A(99)%k (and all the
> > in-between values) can legitimately be transferred according to the
> > spec  
> 
> I don't read it that way, I'm afraid.  :-O
> 
> > how much
> > of a problem is that? In particular, are there situations where this
> > "over-updating" can lead to incorrect results in a conforming
> > program?  
> 
> In your reading indeed it wouldn't, because the user couldn't expect
> the following:
> 
> > Perhaps the question is, can a user legitimately expect the host and
> > offloaded versions of some given memory block to hold different
> > data, like maintaining different data in a cache than the storage
> > backing that cache? One use-case for that might be double buffering
> > a "single array" (i.e. the host and device versions of that array).
> > I don't think that's something we'd want to encourage, though.  
> 
> I find the wording in the spec rather explicitly, for example: OpenACC
> 3.1, 2.7 "Data Clauses":
> 
> | In all cases, the compiler will allocate and manage a copy of the
> 'var' | in the memory of the current device, creating a *visible
> device copy* | of that 'var', for data not in shared memory.
> 
> Emphasis mine; and I indeed understand this to mean that the user can
> "legitimately expect the host and offloaded versions of some given
> memory block to hold different data, [...]" (your words from above).

I think that's just describing that the copy of the data on the device
is distinct (for pragmatic reasons), not that the user should rely on
it being a separate copy of the data for algorithmic reasons. To do so
would mean a difference in semantics between a program with OpenACC
directives enabled and without, which is I think against the spirit of
the API.

Thanks,

Julian

Reply via email to