https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #6 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
movq/pushq etc. aren't that expensive, if it affects performance it must be
something in the inner loops.  A compiler switch that ignores omp target, teams
and distribute would basically create a new OpenMP version if it would ignore
the requirements on those constructs, you can achive it yourself by using
those in _Pragma in some macro and defining it conditionally based on whether
you want offloading or not, then the "you can ignore all side effects" is
decided by you.  For OpenMP 5.0, there is some work on prescriptive vs.
descriptive clauses/constructs where in your case you could just use a describe
that the loop could be parallelized, simdized and/or offloaded and keep that up
to the implementation what it does with that.

What we perhaps could do is when not offloading try to simplify omp distribute
(if we know omp_get_num_teams () will be always 1), either just by folding the
library calls in that case to 1 or 0, or perhaps doing some more.

#pragma omp target teams
        {
                num_teams=omp_get_num_teams();
        }

#pragma omp parallel
        {
                num_threads=omp_get_num_threads();
        }
in your testcase is just wrong, the target would be ok in OpenMP 4.0, but it is
not in 4.5, num_teams, being a scalar variable, is firstprivate, so you won't
get the value back.
The parallel is racy, to avoid races you'd need #pragma omp single or #pragma
omp master.

Why are you using separate distribute and parallel for constructs and
prescribing what they handle, instead of just using
#pragma omp distribute parallel for
  for (int i = 0; i < N; ++i) D[i] += B[i] * C[i];
?  Do you expect or see any gains from that?

Reply via email to