On Tue, Oct 20, 2015 at 01:17:45PM +0200, Thomas Schwinge wrote: > Always creating (dummy) GOMP_offload_register_ver constructors has been > another suggestion that I had voiced much earlier in this thread (months > ago), but everyone (including me) taking part in the discussion agreed > that it'd cause even higher compile-time overhead.
I'd prefer to just set a flag like "force creation of the GOMP offloading sections" whenever you see one of the APIs or constructs used in the TU, and if that flag is set, even when there are no offloaded vars or functions/kernels, force creation of the corresponding data sections. Either it can be stardard offloading LTO sections, just not containing anything, or, if you want to improve compile-time, it could be special too, so that the linker plugin can quickly identify those that only need offloading support, but don't have any offloaded vars or code. But that can certainly be done as an incremental optimization. For OpenMP that would be whenever #pragma omp target{, data, enter data, exit data} construct is seen (e.g. during gimplification or OMP region nesting checking even better), or for omp_set_default_device omp_get_default_device omp_get_num_devices omp_is_initial_device omp_get_initial_device omp_target_alloc omp_target_free omp_target_is_present omp_target_memcpy omp_target_memcpy_rect omp_target_associate_ptr omp_target_disassociate_ptr calls. Guess for OpenACC you have similar set of calls. The thing is, while OpenACC is standard is pretty much solely about offloading, OpenMP is not, and in many cases programs just use host OpenMP parallelization (at least right now, I bet such programs are significantly larger set than programs that use OpenACC or OpenMP offloading together). Distributions and others will eventually configure the compilers they are shipping to enable the offloading, and if that forces a constructor to every TU or even every shared library just because it has been compiled with -fopenmp, it is unacceptable overhead. For the vendor shipped binary compilers, I'm envisioning ideal would be to be able to configure gcc for many offloading targets, then build such main compiler and offloading target compilers, but package them separately (one package (or set of packages) the base compiler, and then another package (or set of them) for each offloading target. What the -foffload= actually will be in the end from the linked shared library or binary POV would depend both on the configured offloading target, but also on whether the mkoffload binaries are found (or whatever else is needed first from the offloading target). That would mean that we'd not issue hard error or any kind of diagnostics if mkoffload is missing. Is that acceptable, or should that e.g. be limited just to the compiled in configure default (i.e. explicit -foffload= would error if the requested mkoffload is missing, default -foffload= would silently skip unavailable ones; I guess this would be my preference), or should we have two ways of configuring the offloading targets, as hard requirements and as optional support? > So, how to resolve our different opinions? I mean, for any serious > program code, there will be constructor calls into libgomp already; are > you expecting that adding one more really will cause any noticeable > overhead? See above, that is really not the case. Most of OpenMP code doesn't have any constructor calls into libgomp at all, the only exception is GOMP_offload_register{,_ver} at this point. > > What is HWM? Is that OFFLOAD_TARGET_TYPE_LAST what you mean? > > Nathan has used this term before (libgomp/openacc.h:acc_device_t), and he > told me this means "High Water Mark". I have no strong opinion on the > name to use, just want to mention that "*_LAST" sounds to me like that > one still is part of the accepted set, whereas in this case it'd be the > first enumerator outside of the accepted ones. (And I guess, we agree > that "OFFLOAD_TARGET_TYPE_INTEL_LAST = 6" followed by > "OFFLOAD_TARGET_TYPE_INTEL_MIC = OFFLOAD_TARGET_TYPE_INTEL_LAST" is > ugly?) *_LAST or *_last is actually what we use pretty much everywhere, see e.g. lots of places in tree-core.h. > Are you worried about the performance issues of a very short locking > cycle that in the majority of all cases should happen without blocking, > in comparison to performance issues related to host/device memory > transfers or kernel launches that will follow after the call to > gomp_offload_target_enabled_p? I don't really think that is reasonable > to worry about. Yes, I'm worried about that. The lock could be contended, and if you take the lock many times for each construct, it can show up, I'm worried about cache effects etc. It is already bad enough that we take/release the locks for the same device e.g. in each of: void *fn_addr = gomp_get_target_fn_addr (devicep, fn); struct target_mem_desc *tgt_vars = gomp_map_vars (devicep, mapnum, hostaddrs, NULL, sizes, kinds, false, GOMP_MAP_VARS_TARGET); Jakub