Josef Melcr wrote:
DEF_GOACC_BUILTIN (BUILT_IN_GOACC_PARALLEL, "GOACC_parallel_keyed",
BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR,
- ATTR_NOTHROW_LIST)
+ ATTR_CALLBACK_OACC_LIST)
Thus, I wonder whether this should be skipped - and handled
in lock step with the OpenMP/omp target support.
Oh, I totally missed that, thank you. Since the kernel has noclone,
the extra edges shouldn't really disrupt it, but I agree it should be
handled at once. Sorry about that. Should I exclude it and/or resend?
For me, just excluding it is enough. (You might want to send an email
when you committed this patch - and you could attach the final commit to
that email.)
* * *
From the other email thread:
The propagation is not considered profitable enough
OK, missed that. I guess for real-world code, it will work (and possibly
some tuning will make it work also without).
* * *
could we bring the device code into LTO to optimize it further?
If you talk about optimizations on the whole program: the host LTO is
already run with the to-be-offloaded functions, but the device-function
stream out happens rather early.
If you are talking about device-side LTO: This requires some
reorganization of how it is handled and compiling the device-side
libraries as thick libraries (non-LTO and LTO code). Additionally, it
requires that the device-side linker supports the linker plugin. — There
are rather explicit plans on our side (BayLibre) to get this working (at
least for AMD GPUs/gcn) – as mentioned during the offloading BoF, but it
will take a while.
Otherwise, the device side already already always sees all offload
functions – as all (host) TU with offload LTO data are feed into the
same device-side lto1 compiler; "just" the libraries (libgfortran,
libgomp, libstdc++, …) are missing the LTO data (and some data we might
hide from LTO). — We also need to get rid of the force_output flag; not
because they shouldn't be output – but because when it is set, some
legit optimizations are disabled.
Tobias