Hi! I've filed <https://gcc.gnu.org/PR114690> "OpenMP 'indirect' clause: dynamic image loading/unloading" for the following issue:
On 2023-11-13T12:47:04+0100, Tobias Burnus <tob...@codesourcery.com> wrote: > On 13.11.23 11:59, Thomas Schwinge wrote: >>>> Also, for my understanding: why is 'build_indirect_map' done at kernel >>>> invocation time (here) instead of at image load time? >>> The splay_tree is generated on the device itself - and we currently do >>> not start a kernel during GOMP_OFFLOAD_load_image. We could, the >>> question is whether it makes sense. (Generating the splay_tree on the >>> host for the device is a hassle and error prone as it needs to use >>> device pointers at the end.) >> Hmm. It seems conceptually cleaner to me to set this up upfront, and >> avoids potentially slowing down every device kernel invocation (at least >> another function call, and 'gomp_mutex_lock' check). Though, I agree >> this may be "in the noise" with regards to all the other stuff going on >> in 'gomp_gcn_enter_kernel' and elsewhere... > > I think the most common case is GOMP_INDIRECT_ADDR_MAP == NULL. > > The question is whether the lock should/could be moved inside if > (!indirect_array) > or not. Probably yes: > * doing an atomic load for the outer '!indirect array', work on a local array > for > the build up and only assign it at the end - and just after the lock check > again > whether '!indirect array'. > > That way, it is lock free once build but when build there is no race. > >> What I just realize, what's also unclear to me is how the current >> implementation works with regards to several images getting loaded -- >> don't we then overwrite 'GOMP_INDIRECT_ADDR_MAP' instead of >> (conceptually) appending to it? > > Yes, I think that will happen - but it looks as if the same issue exists > also the other code? I think that's not the first variable that has that > issue? > > I think we should try to cleanup that handling, also to support calling > a device function in a shared library from a target region in the main > program, which currently also fails. > > All device routines that are in normal static libraries and in the > object files of the main program should simply work thanks to offload > LTO such that there is only a single GOMP_offload_register_ver call (per > device type) and GOMP_OFFLOAD_load_image call (per device). > > Likewise if the offloading is only done via a single shared library. — > Any mixing will currently fail, unfortunately. This patch just adds > another item which does not handle it properly. > > (Not good but IMHO also not a showstopper for this patch.) > >> In the general case, additional images may also get loaded during >> execution. We thus need proper locking of the shared data structure, uh? >> Or, can we have separate on-device data structures per image? (I've not >> yet thought about that in detail.) > > I think we could - but in the main-program 'omp target' case that calls > a shared-library 'declare target' function means that we need to handle > multiple GOMP_offload_register_ver / load_image calls such that they can > work together. > > Obviously, it gets harder if the user keeps doing dlopen() / dlclose() > of libraries containing offload code where a target/compute region is > run before, between, and after those calls (but hopefully not running > when calling dlopen/dlclose). > >> Relatedly then, when images are unloaded, we also need to remove stale >> items from the table, and release resources (for example, the >> 'GOMP_OFFLOAD_alloc' for 'map_target_addr'). > > True. I think the general assumption is that images only get unloaded at > the very end, which matches most but not all code. Yet another work item. > > I think we should open a new PR about this topic and collect work items > there. Grüße Thomas