Hi,

On 10/14/25 18:28, Tobias Burnus wrote:
Josef Melcr wrote:
 DEF_GOACC_BUILTIN (BUILT_IN_GOACC_PARALLEL, "GOACC_parallel_keyed",
                   BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_VAR,
-                  ATTR_NOTHROW_LIST)
+                  ATTR_CALLBACK_OACC_LIST)

Thus, I wonder whether this should be skipped - and handled
in lock step with the OpenMP/omp target support.
Oh, I totally missed that, thank you.  Since the kernel has noclone, the extra edges shouldn't really disrupt it, but I agree it should be handled at once.  Sorry about that.  Should I exclude it and/or resend?

For me, just excluding it is enough. (You might want to send an email when you committed this patch - and you could attach the final commit to that email.)
Okay.  I was going to send that email anyway, I am just a bit anxious since it's my first real commit, so I am just making sure. :)

* * *

From the other email thread:

The propagation is not considered profitable enough
OK, missed that. I guess for real-world code, it will work (and possibly some tuning will make it work also without).

* * *
It might not, I've had issues with getting the pass to propagate before, but I thought those were already fixed.  Jakub's idea from the other thread might make it work though.
could we bring the device code into LTO to optimize it further?
If you talk about optimizations on the whole program: the host LTO is already run with the to-be-offloaded functions, but the device-function stream out happens rather early.

If you are talking about device-side LTO: This requires some reorganization of how it is handled and compiling the device-side libraries as thick libraries (non-LTO and LTO code). Additionally, it requires that the device-side linker supports the linker plugin. — There are rather explicit plans on our side (BayLibre) to get this working (at least for AMD GPUs/gcn) – as mentioned during the offloading BoF, but it will take a while.

Otherwise, the device side already already always sees all offload functions – as all (host) TU with offload LTO data are feed into the same device-side lto1 compiler; "just" the libraries (libgfortran, libgomp, libstdc++, …) are missing the LTO data (and some data we might hide from LTO). — We also need to get rid of the force_output flag; not because they shouldn't be output – but because when it is set, some legit optimizations are disabled.
Oh I see.  So till then, delaying the output does seem like the best option.  Thank you for clearing that up :)
Tobias

Best regards,

Josef

Reply via email to