On 12/02/15 08:46, Jakub Jelinek wrote:

Or does the OpenACC execution model not allow anything like that, i.e.
have some function with an automatic variable pass the address of that
variable to some other function and that other function use #acc loop kind
that expects the caller to be at the worker level and splits the work among
the threads in the warp, on the array section pointed by that passed in
pointer?  See the OpenMP testcase I've posted in this thread.

There are two cases to consider

1) the caller (& address taker) is already partitioned. Thus the callers' frames are already copied. The caller takes the address of the object in its own frame.

An example would be calling say __mulcd3 where the return value location is passed by pointer.

2) the caller is not partitioned and calls a function containing a partitioned loop. The caller takes the address of its instance of the variable. As part of the RTL expansion we have to convert addresses (to be stored in registers) to the generic address space. That conversion creates a pointer that may be used by any thread (on the same CTA)[*]. The function call is executed by all threads (they're partially un-neutered before the call). In the partitioned loop, each thread ends up accessing the location in the frame of the original calling active thread.

[*] although .local is private to each thread, it's placed in memory that is reachable from anywhere, provided a generic address is used. Essentially it's like TLS and genericization is simply adding the thread pointer to the local memory offset to create a generic address.

nathan

Reply via email to