https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859
--- Comment #6 from Jakub Jelinek <jakub at gcc dot gnu.org> --- movq/pushq etc. aren't that expensive, if it affects performance it must be something in the inner loops. A compiler switch that ignores omp target, teams and distribute would basically create a new OpenMP version if it would ignore the requirements on those constructs, you can achive it yourself by using those in _Pragma in some macro and defining it conditionally based on whether you want offloading or not, then the "you can ignore all side effects" is decided by you. For OpenMP 5.0, there is some work on prescriptive vs. descriptive clauses/constructs where in your case you could just use a describe that the loop could be parallelized, simdized and/or offloaded and keep that up to the implementation what it does with that. What we perhaps could do is when not offloading try to simplify omp distribute (if we know omp_get_num_teams () will be always 1), either just by folding the library calls in that case to 1 or 0, or perhaps doing some more. #pragma omp target teams { num_teams=omp_get_num_teams(); } #pragma omp parallel { num_threads=omp_get_num_threads(); } in your testcase is just wrong, the target would be ok in OpenMP 4.0, but it is not in 4.5, num_teams, being a scalar variable, is firstprivate, so you won't get the value back. The parallel is racy, to avoid races you'd need #pragma omp single or #pragma omp master. Why are you using separate distribute and parallel for constructs and prescribing what they handle, instead of just using #pragma omp distribute parallel for for (int i = 0; i < N; ++i) D[i] += B[i] * C[i]; ? Do you expect or see any gains from that?