Hi!

I've filed <https://gcc.gnu.org/PR114690>
"OpenMP 'indirect' clause: dynamic image loading/unloading" for the
following issue:

On 2023-11-13T12:47:04+0100, Tobias Burnus <tob...@codesourcery.com> wrote:
> On 13.11.23 11:59, Thomas Schwinge wrote:
>>>> Also, for my understanding: why is 'build_indirect_map' done at kernel
>>>> invocation time (here) instead of at image load time?
>>> The splay_tree is generated on the device itself - and we currently do
>>> not start a kernel during GOMP_OFFLOAD_load_image. We could, the
>>> question is whether it makes sense. (Generating the splay_tree on the
>>> host for the device is a hassle and error prone as it needs to use
>>> device pointers at the end.)
>> Hmm.  It seems conceptually cleaner to me to set this up upfront, and
>> avoids potentially slowing down every device kernel invocation (at least
>> another function call, and 'gomp_mutex_lock' check).  Though, I agree
>> this may be "in the noise" with regards to all the other stuff going on
>> in 'gomp_gcn_enter_kernel' and elsewhere...
>
> I think the most common case is GOMP_INDIRECT_ADDR_MAP == NULL.
>
> The question is whether the lock should/could be moved inside  if 
> (!indirect_array)
> or not. Probably yes:
> * doing an atomic load for the outer '!indirect array', work on a local array 
> for
> the build up and only assign it at the end - and just after the lock check 
> again
> whether '!indirect array'.
>
> That way, it is lock free once build but when build there is no race.
>
>> What I just realize, what's also unclear to me is how the current
>> implementation works with regards to several images getting loaded --
>> don't we then overwrite 'GOMP_INDIRECT_ADDR_MAP' instead of
>> (conceptually) appending to it?
>
> Yes, I think that will happen - but it looks as if the same issue exists
> also the other code? I think that's not the first variable that has that
> issue?
>
> I think we should try to cleanup that handling, also to support calling
> a device function in a shared library from a target region in the main
> program, which currently also fails.
>
> All device routines that are in normal static libraries and in the
> object files of the main program should simply work thanks to offload
> LTO such that there is only a single GOMP_offload_register_ver call (per
> device type) and GOMP_OFFLOAD_load_image call (per device).
>
> Likewise if the offloading is only done via a single shared library. —
> Any mixing will currently fail, unfortunately. This patch just adds
> another item which does not handle it properly.
>
> (Not good but IMHO also not a showstopper for this patch.)
>
>> In the general case, additional images may also get loaded during
>> execution.  We thus need proper locking of the shared data structure, uh?
>> Or, can we have separate on-device data structures per image?  (I've not
>> yet thought about that in detail.)
>
> I think we could - but in the main-program 'omp target' case that calls
> a shared-library 'declare target' function means that we need to handle
> multiple GOMP_offload_register_ver / load_image calls such that they can
> work together.
>
> Obviously, it gets harder if the user keeps doing dlopen() / dlclose()
> of libraries containing offload code where a target/compute region is
> run before, between, and after those calls (but hopefully not running
> when calling dlopen/dlclose).
>
>> Relatedly then, when images are unloaded, we also need to remove stale
>> items from the table, and release resources (for example, the
>> 'GOMP_OFFLOAD_alloc' for 'map_target_addr').
>
> True. I think the general assumption is that images only get unloaded at
> the very end, which matches most but not all code. Yet another work item.
>
> I think we should open a new PR about this topic and collect work items
> there.


Grüße
 Thomas

Reply via email to