On 12/02/15 08:46, Jakub Jelinek wrote:
Or does the OpenACC execution model not allow anything like that, i.e.
have some function with an automatic variable pass the address of that
variable to some other function and that other function use #acc loop kind
that expects the caller to be at the worker level and splits the work among
the threads in the warp, on the array section pointed by that passed in
pointer? See the OpenMP testcase I've posted in this thread.
There are two cases to consider
1) the caller (& address taker) is already partitioned. Thus the callers'
frames are already copied. The caller takes the address of the object in its
own frame.
An example would be calling say __mulcd3 where the return value location is
passed by pointer.
2) the caller is not partitioned and calls a function containing a partitioned
loop. The caller takes the address of its instance of the variable. As part of
the RTL expansion we have to convert addresses (to be stored in registers) to
the generic address space. That conversion creates a pointer that may be used
by any thread (on the same CTA)[*]. The function call is executed by all
threads (they're partially un-neutered before the call). In the partitioned
loop, each thread ends up accessing the location in the frame of the original
calling active thread.
[*] although .local is private to each thread, it's placed in memory that is
reachable from anywhere, provided a generic address is used. Essentially it's
like TLS and genericization is simply adding the thread pointer to the local
memory offset to create a generic address.
nathan