On Sun, Oct 16, 2022 at 07:23:05PM -0600, Sandra Loosemore wrote: > My sense is that the first approach would be more straightforward than the > second one, and I am willing to continue to work on that. However, I think > I need some direction to get started, as I presently know nothing about > cgraph and I was unable to find any useful overview or interface > documentation in the GCC internals manual. Is this as simple as inserting > an existing pass into the passlist to clean up after vectorization, or does > it involve writing something more or less from scratch?
We (as I've discovered during the work on assumptions) have TODO_discard_function which when returned from an execute pass throws away a function completely (except now assumption functions for which it doesn't release body; this could be done in some pass shortly after IPA, or alternatively before expansion). But another thing that needs to be done is for the non-public declare simd clones (both explicit and implicit from your patch) to be ordered in cgraph after anything that has a cgraph edge to its original function. I don't know how to do that, you should talk to Honza, Richi or Martin about that. I think the current behavior is that callees are processed before callers if possible (unless there are cycles), which is certainly what we want for say assume functions, or IPA RA etc. But in case of non-public simd clones we want to do it the other way around (at the expense of IPA RA), so that we can throw away functions which aren't needed. > > I admit I don't remember where exactly the simd clone happens wrt. other > > IPA passes, but I think it is late pass; so, does it happen for GCN > > offloading only in the lto1 offloading compiler? > > Shouldn't the auto optimization be then done only in the offloading > > lto1 for GCN then (say guard on targetm boolean)? > > I'm afraid I don't know much about offloading, but I was under the > impression it all goes through the same compilation process, just with a > different target? I've looked at it today and it seems late ipa passes are executed after LTO bytecode is streamed back in. If you say try: #pragma omp declare simd int foo (int x) { return x; } int main () { int a[64] = {}; #pragma omp target map(a) #pragma omp simd for (int i = 0; i < 64; i++) a[i] = foo (a[i]); } with gcc -foffload-options='-fdump-tree-all -fdump-ipa-all' -fdump-tree-all -fdump-ipa-all -O2 -fopenmp a.c -o a you ought to see the simdclone dump both as a.c.*i.simdclone and a.x*.mkoffload.*i.simdclone where the former is what is done for the host code (and host fallback), while the latter is what is done in the offloading lto. Can't verify it 100% because I have only nvptx-none offloading configured and in that case pass_omp_simd_clone::gate is disabled in offloading lto because targetm.simd_clone.compute_vecsize_and_simdlen is NULL for nvptx. But it is non-NULL for gcn. Thus, IMHO it is exactly the pass_omp_simd_clone pass where you want to implement this auto-simdization discovery, guarded with #ifdef ACCEL_COMPILER and the new option (which means it will be done only for gcn and not on the host right now). And do it at the start of ipa_omp_simd_clone, before the FOR_EACH_FUNCTION (node) expand_simd_clones (node); loop, or, if it is purely local decision for each function, at the start of expand_simd_clones with similar guarding, punt on functions with "noclone" attribute, or !node->definition. You need to repeat the if (node->has_gimple_body_p ()) node->get_body (); to get body before you analyze it. And please put the new functions for such analysis into omp-simd-clone.cc where they belong. Jakub