Hi! So I recently had reason to verify how 'static' variables behave in OpenACC compute construct regions as well as OpenACC 'routine'. Two weeks ago I started writing a few testcases -- and today then wondered if maybe there's something in the GCC archives about this. And there is: directly related are <https://gcc.gnu.org/PR84991> "[openacc] Misleading error message for function static var in routine", <https://gcc.gnu.org/PR84992> "[openacc] function static var in parallel", and more generally <https://gcc.gnu.org/PR90779> "Fortran array initialization in offload regions" that is discussed here. (I had taken part in at least some of these discussions, yet didn't directly remember these now... 8-| Too much going on?)
Anyway: On 2019-06-15T10:19:23+0200, Tom de Vries <tdevr...@suse.de> wrote: > On 13-06-19 10:34, Jakub Jelinek wrote: >> The OpenMP specification isn't clear on this, I'll work on getting that >> clarified for 5.1, but the agreement on omp-lang has been that it should >> work the way the patch implements it, static block scope variables inside of >> #pragma omp target or #pragma omp declare target routines are handled as if >> they have #pragma omp declare target to (variable). >> >> Bootstrapped/regtested on x86_64-linux and i686-linux, unfortunately it >> regresses: >> +FAIL: c-c++-common/goacc/routine-5.c (test for errors, line 204) >> >> Thus, I'm not committing it right now and want to ask what should be done >> for OpenACC. > > OpenACC 2.6 - 2.15.1. Routine Directive - Restrictions: > ... > In C and C++, function static variables are not supported in functions > to which a routine directive applies. > ... > [ And text is still the same for 2.7. ] ..., and still in OpenACC 3.1. But yes, that seems somewhat incomplete and/or inconsistent. I've filed <https://github.com/OpenACC/openacc-spec/issues/372> "C/C++ 'static' variables" (only visible to members of the GitHub OpenACC organization). I do agree that even if OpenACC ultimately doesn't want to support certain cases of (?) 'static' variables, we still have to (and evidently can) support 'static' for compiler-synthesized variables, per our own desired semantics, which happen to match OpenMP's (as I understand this). >> The patch uses & ORT_TARGET tests, so it affects both OpenMP >> target region, and OpenACC parallel/kernels and both OpenMP and OpenACC >> target routines. Is it ok to do it that way and just adjust the routine-5.c >> test, or shall it test (ctx->region_type & (ORT_TARGET | ORT_ACC)) == >> ORT_TARGET, i.e. only OpenMP and not OpenACC? If so, there is still the >> problem that gimplify_body.c does: >> if (flag_openacc || flag_openmp) >> { >> gcc_assert (gimplify_omp_ctxp == NULL); >> if (lookup_attribute ("omp declare target", DECL_ATTRIBUTES (fndecl))) >> gimplify_omp_ctxp = new_omp_context (ORT_TARGET); >> } >> We'd need different attribute (or additional attribute) for OpenACC routines >> and would need to use new_omp_context (cond ? ORT_TARGET : ORT_ACC_PARALLEL) >> or similar to express OpenACC routines. I'm fine to have this supported for GCC/OpenACC in the way that it currently is, so no need to special-case that. If OpenACC decides otherwise, we'll then adjust. (I have not reviewed the PR90779 code changes; it was sufficient for my case to understand what I called GCC's observed behavior.) I've now pushed "Add 'libgomp.oacc-c-c++-common/static-variable-1.c' [PR84991, PR84992, PR90779]" to master branch in commit ffa0ae6eeef3ad15d3f288283e4c477193052f1a, and releases/gcc-10 branch in commit 60b589b5858fb8ad414583c6b493e0897f1bde5f, see attached. (The PR90779 code changes never got backported to GCC 9 and 8 release branches.) Also I've filed <https://gcc.gnu.org/PR100001> "[GCN offloading] Occasional C++ 'libgomp.oacc-c-c++-common/static-variable-1.c' execution failure". Grüße Thomas >> 2019-06-12 Jakub Jelinek <ja...@redhat.com> >> >> PR middle-end/90779 >> * gimplify.c (gimplify_bind_expr): Add "omp declare target" attributes >> to static block scope variables inside of target region or target >> functions. >> >> * testsuite/libgomp.c/pr90779.c: New test. >> * testsuite/libgomp.fortran/pr90779.f90: New test. >> >> --- gcc/gimplify.c.jj 2019-06-10 19:42:03.868959986 +0200 >> +++ gcc/gimplify.c 2019-06-12 13:00:18.765167777 +0200 >> @@ -1323,17 +1323,37 @@ gimplify_bind_expr (tree *expr_p, gimple >> struct gimplify_omp_ctx *ctx = gimplify_omp_ctxp; >> >> /* Mark variable as local. */ >> - if (ctx && ctx->region_type != ORT_NONE && !DECL_EXTERNAL (t) >> - && (! DECL_SEEN_IN_BIND_EXPR_P (t) >> - || splay_tree_lookup (ctx->variables, >> - (splay_tree_key) t) == NULL)) >> + if (ctx && ctx->region_type != ORT_NONE && !DECL_EXTERNAL (t)) >> { >> - if (ctx->region_type == ORT_SIMD >> - && TREE_ADDRESSABLE (t) >> - && !TREE_STATIC (t)) >> - omp_add_variable (ctx, t, GOVD_PRIVATE | GOVD_SEEN); >> - else >> - omp_add_variable (ctx, t, GOVD_LOCAL | GOVD_SEEN); >> + if (! DECL_SEEN_IN_BIND_EXPR_P (t) >> + || splay_tree_lookup (ctx->variables, >> + (splay_tree_key) t) == NULL) >> + { >> + if (ctx->region_type == ORT_SIMD >> + && TREE_ADDRESSABLE (t) >> + && !TREE_STATIC (t)) >> + omp_add_variable (ctx, t, GOVD_PRIVATE | GOVD_SEEN); >> + else >> + omp_add_variable (ctx, t, GOVD_LOCAL | GOVD_SEEN); >> + } >> + /* Static locals inside of target construct or offloaded >> + routines need to be "omp declare target". */ >> + if (TREE_STATIC (t)) >> + for (; ctx; ctx = ctx->outer_context) >> + if ((ctx->region_type & ORT_TARGET) != 0) >> + { >> + if (!lookup_attribute ("omp declare target", >> + DECL_ATTRIBUTES (t))) >> + { >> + tree id = get_identifier ("omp declare target"); >> + DECL_ATTRIBUTES (t) >> + = tree_cons (id, NULL_TREE, DECL_ATTRIBUTES (t)); >> + varpool_node *node = varpool_node::get (t); >> + if (node) >> + node->offloadable = 1; >> + } >> + break; >> + } >> } >> >> DECL_SEEN_IN_BIND_EXPR_P (t) = 1; >> --- libgomp/testsuite/libgomp.c/pr90779.c.jj 2019-06-12 13:01:57.081667587 >> +0200 >> +++ libgomp/testsuite/libgomp.c/pr90779.c 2019-06-12 12:41:15.637730797 >> +0200 >> @@ -0,0 +1,18 @@ >> +/* PR middle-end/90779 */ >> + >> +extern void abort (void); >> + >> +int >> +main () >> +{ >> + int i, j; >> + for (i = 0; i < 2; ++i) >> + #pragma omp target map(from: j) >> + { >> + static int k = 5; >> + j = ++k; >> + } >> + if (j != 7) >> + abort (); >> + return 0; >> +} >> --- libgomp/testsuite/libgomp.fortran/pr90779.f90.jj 2019-06-12 >> 12:43:17.891825811 +0200 >> +++ libgomp/testsuite/libgomp.fortran/pr90779.f90 2019-06-12 >> 12:43:08.421973375 +0200 >> @@ -0,0 +1,12 @@ >> +! PR middle-end/90779 >> + >> +program pr90779 >> + implicit none >> + integer :: v(4), i >> + >> + !$omp target map(from:v) >> + v(:) = (/ (i, i=1,4) /) >> + !$omp end target >> + >> + if (any (v .ne. (/ (i, i=1,4) /))) stop 1 >> +end program >> >> Jakub >> ----------------- Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank Thürauf
>From ffa0ae6eeef3ad15d3f288283e4c477193052f1a Mon Sep 17 00:00:00 2001 From: Thomas Schwinge <tho...@codesourcery.com> Date: Fri, 9 Apr 2021 16:03:32 +0200 Subject: [PATCH] Add 'libgomp.oacc-c-c++-common/static-variable-1.c' [PR84991, PR84992, PR90779] libgomp/ PR middle-end/84991 PR middle-end/84992 PR middle-end/90779 * testsuite/libgomp.oacc-c-c++-common/static-variable-1.c: New. --- .../static-variable-1.c | 460 ++++++++++++++++++ 1 file changed, 460 insertions(+) create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/static-variable-1.c diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/static-variable-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/static-variable-1.c new file mode 100644 index 00000000000..1d415cdcf76 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/static-variable-1.c @@ -0,0 +1,460 @@ +/* "Function scope" (top-level block scope) 'static' variables + + ... inside OpenACC compute construct regions as well as OpenACC 'routine'. + + This is to document/verify aspects of GCC's observed behavior, not + necessarily as it's (intended to be?) restricted by the OpenACC + specification. See also PR84991, PR84992, PR90779 etc., and + <https://github.com/OpenACC/openacc-spec/issues/372> "C/C++ 'static' + variables" (only visible to members of the GitHub OpenACC organization). +*/ + + +#undef NDEBUG +#include <assert.h> +#include <string.h> +#include <openacc.h> +#include <gomp-constants.h> + + +#define IF_DEBUG if (0) + + +/* Without explicit 'num_gangs'. */ + +static void t0_c(void) +{ + IF_DEBUG + __builtin_printf ("%s\n", __FUNCTION__); + + const int i_limit = 11; + const int var_init = 16; + + for (int i = 0; i < i_limit; ++i) + { + int result = 0; + int num_gangs_actual = -1; +#pragma acc parallel \ + reduction(max:num_gangs_actual) \ + reduction(max:result) + { + num_gangs_actual = 1 + __builtin_goacc_parlevel_id(GOMP_DIM_GANG); + + static int var = var_init; + +#pragma acc atomic capture + result = ++var; + + /* Irrespective of the order in which the gang-redundant threads + execute, 'var' has now been incremented 'num_gangs_actual' times, and + the final value captured as 'result'. */ + } + /* Without an explicit 'num_gangs' clause GCC assigns 'num_gangs(1)' + because it doesn't see any use of gang-level parallelism inside the + region. */ + assert(num_gangs_actual == 1); + assert(result == var_init + num_gangs_actual * (1 + i)); + } +} + + +/* Call a gang-level routine. */ + +static const int t0_r_var_init = 61; + +#pragma acc routine gang +__attribute__((noinline)) +static int t0_r_r(void) +{ + static int var = t0_r_var_init; + + int tmp; +#pragma acc atomic capture + tmp = ++var; + + return tmp; +} + +static void t0_r(void) +{ + IF_DEBUG + __builtin_printf ("%s\n", __FUNCTION__); + + const int i_limit = 11; + + for (int i = 0; i < i_limit; ++i) + { + int result = 0; + int num_gangs_actual = -1; +#pragma acc parallel \ + reduction(max:num_gangs_actual) \ + reduction(max:result) + { + num_gangs_actual = 1 + __builtin_goacc_parlevel_id(GOMP_DIM_GANG); + + result = t0_r_r(); + + /* Irrespective of the order in which the gang-redundant threads + execute, 'var' has now been incremented 'num_gangs_actual' times, and + the final value captured as 'result'. */ + } + /* The number of gangs selected by the implemention ought to but must not + be bigger than one. */ + IF_DEBUG + __builtin_printf ("%d: num_gangs_actual: %d\n", i, num_gangs_actual); + assert(num_gangs_actual >= 1); + assert(result == t0_r_var_init + num_gangs_actual * (1 + i)); + } +} + + +/* Explicit 'num_gangs'. */ + +static void t1_c(void) +{ + IF_DEBUG + __builtin_printf ("%s\n", __FUNCTION__); + + const int i_limit = 22; + const int num_gangs_request = 444; + const int var_init = 5; + + for (int i = 0; i < i_limit; ++i) + { + int result = 0; + int num_gangs_actual = -1; +#pragma acc parallel \ + num_gangs(num_gangs_request) \ + reduction(max:num_gangs_actual) \ + reduction(max:result) + { + num_gangs_actual = 1 + __builtin_goacc_parlevel_id(GOMP_DIM_GANG); + + static int var = var_init; + +#pragma acc atomic capture + result = ++var; + + /* Irrespective of the order in which the gang-redundant threads + execute, 'var' has now been incremented 'num_gangs_actual' times, and + the final value captured as 'result'. */ + } + if (acc_get_device_type() == acc_device_host) + assert(num_gangs_actual == 1); + else + assert(num_gangs_actual == num_gangs_request); + assert(result == var_init + num_gangs_actual * (1 + i)); + } +} + + +/* Check the same routine called from two compute constructs. */ + +static const int t1_r2_var_init = 166; + +#pragma acc routine gang +__attribute__((noinline)) +static int t1_r2_r(void) +{ + static int var = t1_r2_var_init; + + int tmp; +#pragma acc atomic capture + tmp = ++var; + + return tmp; +} + +static void t1_r2(void) +{ + IF_DEBUG + __builtin_printf ("%s\n", __FUNCTION__); + + const int i_limit = 71; + /* The checking assumes the same 'num_gangs' for all compute constructs. */ + const int num_gangs_request = 333; + int num_gangs_actual = -1; + if (acc_get_device_type() == acc_device_host) + num_gangs_actual = 1; + else + { + /* We're assuming that the implementation is able to accomodate the + 'num_gangs' requested (which really ought to be true for + 'num_gangs'). */ + num_gangs_actual = num_gangs_request; + } + + for (int i = 0; i < i_limit; ++i) + { + int result_1 = 0; +#pragma acc parallel \ + num_gangs(num_gangs_request) \ + reduction(max:result_1) + { + result_1 = t1_r2_r(); + + /* Irrespective of the order in which the gang-redundant threads + execute, 'var' has now been incremented 'num_gangs_actual' times, and + the final value captured as 'result_1'. */ + } + IF_DEBUG + __builtin_printf ("%d: result_1: %d\n", i, result_1); + assert(result_1 == t1_r2_var_init + num_gangs_actual * (1 + (i * 3 + 0))); + + int result_2 = 0; +#pragma acc parallel \ + num_gangs(num_gangs_request) \ + reduction(max:result_2) + { + result_2 = t1_r2_r() + t1_r2_r(); + + /* Irrespective of the order in which the gang-redundant threads + execute, 'var' has now been incremented '2 * num_gangs_actual' times. + However, the order of the two 't1_r2_r' function calls is not + synchronized (between different gang-redundant threads). We thus + cannot verify the actual 'result_2' values in this case. */ + } + IF_DEBUG + __builtin_printf ("%d: result_2: %d\n", i, result_2); + if (num_gangs_actual == 1) + /* Per the rationale above, only in this case we can check the actual + result. */ + assert(result_2 == (t1_r2_var_init + num_gangs_actual * (1 + (i * 3 + 1)) + + t1_r2_var_init + num_gangs_actual * (1 + (i * 3 + 2)))); + /* But we can generally check low and high limits. */ + { + /* Must be bigger than '2 * result_1'. */ + int c = 2 * result_1; + IF_DEBUG + __builtin_printf (" > %d\n", c); + assert(result_2 > c); + } + { + /* ..., but limited by the base value for next 'i'. */ + int c = 2 * (t1_r2_var_init + num_gangs_actual * (0 + ((i + 1) * 3 + 0))); + IF_DEBUG + __builtin_printf (" < %d\n", c); + assert(result_2 < c); + } + } +} + + +/* Asynchronous execution. */ + +static const int t2_var_init_2 = -55; + +#pragma acc routine gang +__attribute__((noinline)) +static int t2_r(void) +{ + static int var = t2_var_init_2; + + int tmp; +#pragma acc atomic capture + tmp = ++var; + + return tmp; +} + +static void t2(void) +{ + IF_DEBUG + __builtin_printf ("%s\n", __FUNCTION__); + + const int i_limit = 12; + const int num_gangs_request_1 = 14; + const int var_init_1 = 5; + int results_1[i_limit][num_gangs_request_1]; + memset (results_1, 0, sizeof results_1); + const int num_gangs_request_2 = 5; + int results_2[i_limit][num_gangs_request_2]; + memset (results_2, 0, sizeof results_2); + const int num_gangs_request_3 = 34; + const int var_init_3 = 1250; + int results_3[i_limit][num_gangs_request_3]; + memset (results_3, 0, sizeof results_3); + +#pragma acc data \ + copy(results_1, results_2, results_3) + { + for (int i = 0; i < i_limit; ++i) + { + /* The following 'async' clauses effect asynchronous execution, but + using the same async-argument for each compute construct implies that + the respective compute constructs' execution is synchronized with + itself, meaning that all 'i = 0' execution has finished (on the + device) before 'i = 1' is started (on the device), etc. */ + +#pragma acc parallel \ + present(results_1) \ + num_gangs(num_gangs_request_1) \ + async(1) + { + static int var = var_init_1; + + int tmp; +#pragma acc atomic capture + tmp = ++var; + + results_1[i][__builtin_goacc_parlevel_id(GOMP_DIM_GANG)] += tmp; + } + +#pragma acc parallel \ + present(results_2) \ + num_gangs(num_gangs_request_2) \ + async(2) + { + results_2[i][__builtin_goacc_parlevel_id(GOMP_DIM_GANG)] += t2_r(); + } + +#pragma acc parallel \ + present(results_3) \ + num_gangs(num_gangs_request_3) \ + async(3) + { + static int var = var_init_3; + + int tmp; +#pragma acc atomic capture + tmp = ++var; + + results_3[i][__builtin_goacc_parlevel_id(GOMP_DIM_GANG)] += tmp; + } + } +#pragma acc wait + } + int num_gangs_actual_1; + int num_gangs_actual_2; + int num_gangs_actual_3; + if (acc_get_device_type() == acc_device_host) + { + num_gangs_actual_1 = 1; + num_gangs_actual_2 = 1; + num_gangs_actual_3 = 1; + } + else + { + /* We're assuming that the implementation is able to accomodate the + 'num_gangs' requested (which really ought to be true for + 'num_gangs'). */ + num_gangs_actual_1 = num_gangs_request_1; + num_gangs_actual_2 = num_gangs_request_2; + num_gangs_actual_3 = num_gangs_request_3; + } + + /* For 'i = 0', 'results_*[i][0..num_gangs_actual_*]' are expected to each + contain one value of '(1 + var_init_*)..(var_init_* + num_gangs_actual_*)', + and so on for increasing 'i'. Their order however is unspecified due to + the gang-redundant execution. (Thus checking that their sums match.) */ + + int result_1 = 0; + int result_2 = 0; + int result_3 = 0; + for (int i = 0; i < i_limit; ++i) + { + int result_1_ = 0; + for (int g = 0; g < num_gangs_actual_1; ++g) + { + IF_DEBUG + __builtin_printf ("results_1[%d][%d]: %d\n", i, g, results_1[i][g]); + result_1_ += results_1[i][g]; + } + IF_DEBUG + __builtin_printf ("%d result_1_: %d\n", i, result_1_); + assert (result_1_ == (((var_init_1 + num_gangs_actual_1 * (1 + i)) * (1 + var_init_1 + num_gangs_actual_1 * (1 + i)) / 2) + - ((var_init_1 + num_gangs_actual_1 * (0 + i)) * (1 + var_init_1 + num_gangs_actual_1 * (0 + i)) / 2))); + result_1 += result_1_; + + int result_2_ = 0; + for (int g = 0; g < num_gangs_actual_2; ++g) + { + IF_DEBUG + __builtin_printf ("results_2[%d][%d]: %d\n", i, g, results_2[i][g]); + result_2_ += results_2[i][g]; + } + IF_DEBUG + __builtin_printf ("%d result_2_: %d\n", i, result_2_); + assert (result_2_ == (((t2_var_init_2 + num_gangs_actual_2 * (1 + i)) * (1 + t2_var_init_2 + num_gangs_actual_2 * (1 + i)) / 2) + - ((t2_var_init_2 + num_gangs_actual_2 * (0 + i)) * (1 + t2_var_init_2 + num_gangs_actual_2 * (0 + i)) / 2))); + result_2 += result_2_; + + int result_3_ = 0; + for (int g = 0; g < num_gangs_actual_3; ++g) + { + IF_DEBUG + __builtin_printf ("results_3[%d][%d]: %d\n", i, g, results_3[i][g]); + result_3_ += results_3[i][g]; + } + IF_DEBUG + __builtin_printf ("%d result_3_: %d\n", i, result_3_); + assert (result_3_ == (((var_init_3 + num_gangs_actual_3 * (1 + i)) * (1 + var_init_3 + num_gangs_actual_3 * (1 + i)) / 2) + - ((var_init_3 + num_gangs_actual_3 * (0 + i)) * (1 + var_init_3 + num_gangs_actual_3 * (0 + i)) / 2))); + result_3 += result_3_; + } + IF_DEBUG + __builtin_printf ("result_1: %d\n", result_1); + assert (result_1 == (((var_init_1 + num_gangs_actual_1 * i_limit) * (1 + var_init_1 + num_gangs_actual_1 * i_limit) / 2) + - (var_init_1 * (var_init_1 + 1) / 2))); + IF_DEBUG + __builtin_printf ("result_2: %d\n", result_2); + assert (result_2 == (((t2_var_init_2 + num_gangs_actual_2 * i_limit) * (1 + t2_var_init_2 + num_gangs_actual_2 * i_limit) / 2) + - (t2_var_init_2 * (t2_var_init_2 + 1) / 2))); + IF_DEBUG + __builtin_printf ("result_3: %d\n", result_3); + assert (result_3 == (((var_init_3 + num_gangs_actual_3 * i_limit) * (1 + var_init_3 + num_gangs_actual_3 * i_limit) / 2) + - (var_init_3 * (var_init_3 + 1) / 2))); +} + + +#pragma acc routine seq +__attribute__((noinline)) +static int pr84991_1_r_s(int n) +{ + static const int test[] = {1,2,3,4}; + return test[n]; +} + +static void pr84991_1(void) +{ + int n[1]; + n[0] = 3; +#pragma acc parallel copy(n) + { + n[0] = pr84991_1_r_s(n[0]); + } + assert(n[0] == 4); +} + + +static void pr84992_1(void) +{ + int n[1]; + n[0] = 3; +#pragma acc parallel copy(n) + { + static const int test[] = {1,2,3,4}; + n[0] = test[n[0]]; + } + assert(n[0] == 4); +} + + +int main(void) +{ + t0_c(); + + t0_r(); + + t1_c(); + + t1_r2(); + + t2(); + + pr84991_1(); + + pr84992_1(); + + return 0; +} -- 2.30.2
>From 60b589b5858fb8ad414583c6b493e0897f1bde5f Mon Sep 17 00:00:00 2001 From: Thomas Schwinge <tho...@codesourcery.com> Date: Fri, 9 Apr 2021 16:03:32 +0200 Subject: [PATCH] Add 'libgomp.oacc-c-c++-common/static-variable-1.c' [PR84991, PR84992, PR90779] libgomp/ PR middle-end/84991 PR middle-end/84992 PR middle-end/90779 * testsuite/libgomp.oacc-c-c++-common/static-variable-1.c: New. (cherry picked from commit ffa0ae6eeef3ad15d3f288283e4c477193052f1a) --- .../static-variable-1.c | 460 ++++++++++++++++++ 1 file changed, 460 insertions(+) create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/static-variable-1.c diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/static-variable-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/static-variable-1.c new file mode 100644 index 00000000000..1d415cdcf76 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/static-variable-1.c @@ -0,0 +1,460 @@ +/* "Function scope" (top-level block scope) 'static' variables + + ... inside OpenACC compute construct regions as well as OpenACC 'routine'. + + This is to document/verify aspects of GCC's observed behavior, not + necessarily as it's (intended to be?) restricted by the OpenACC + specification. See also PR84991, PR84992, PR90779 etc., and + <https://github.com/OpenACC/openacc-spec/issues/372> "C/C++ 'static' + variables" (only visible to members of the GitHub OpenACC organization). +*/ + + +#undef NDEBUG +#include <assert.h> +#include <string.h> +#include <openacc.h> +#include <gomp-constants.h> + + +#define IF_DEBUG if (0) + + +/* Without explicit 'num_gangs'. */ + +static void t0_c(void) +{ + IF_DEBUG + __builtin_printf ("%s\n", __FUNCTION__); + + const int i_limit = 11; + const int var_init = 16; + + for (int i = 0; i < i_limit; ++i) + { + int result = 0; + int num_gangs_actual = -1; +#pragma acc parallel \ + reduction(max:num_gangs_actual) \ + reduction(max:result) + { + num_gangs_actual = 1 + __builtin_goacc_parlevel_id(GOMP_DIM_GANG); + + static int var = var_init; + +#pragma acc atomic capture + result = ++var; + + /* Irrespective of the order in which the gang-redundant threads + execute, 'var' has now been incremented 'num_gangs_actual' times, and + the final value captured as 'result'. */ + } + /* Without an explicit 'num_gangs' clause GCC assigns 'num_gangs(1)' + because it doesn't see any use of gang-level parallelism inside the + region. */ + assert(num_gangs_actual == 1); + assert(result == var_init + num_gangs_actual * (1 + i)); + } +} + + +/* Call a gang-level routine. */ + +static const int t0_r_var_init = 61; + +#pragma acc routine gang +__attribute__((noinline)) +static int t0_r_r(void) +{ + static int var = t0_r_var_init; + + int tmp; +#pragma acc atomic capture + tmp = ++var; + + return tmp; +} + +static void t0_r(void) +{ + IF_DEBUG + __builtin_printf ("%s\n", __FUNCTION__); + + const int i_limit = 11; + + for (int i = 0; i < i_limit; ++i) + { + int result = 0; + int num_gangs_actual = -1; +#pragma acc parallel \ + reduction(max:num_gangs_actual) \ + reduction(max:result) + { + num_gangs_actual = 1 + __builtin_goacc_parlevel_id(GOMP_DIM_GANG); + + result = t0_r_r(); + + /* Irrespective of the order in which the gang-redundant threads + execute, 'var' has now been incremented 'num_gangs_actual' times, and + the final value captured as 'result'. */ + } + /* The number of gangs selected by the implemention ought to but must not + be bigger than one. */ + IF_DEBUG + __builtin_printf ("%d: num_gangs_actual: %d\n", i, num_gangs_actual); + assert(num_gangs_actual >= 1); + assert(result == t0_r_var_init + num_gangs_actual * (1 + i)); + } +} + + +/* Explicit 'num_gangs'. */ + +static void t1_c(void) +{ + IF_DEBUG + __builtin_printf ("%s\n", __FUNCTION__); + + const int i_limit = 22; + const int num_gangs_request = 444; + const int var_init = 5; + + for (int i = 0; i < i_limit; ++i) + { + int result = 0; + int num_gangs_actual = -1; +#pragma acc parallel \ + num_gangs(num_gangs_request) \ + reduction(max:num_gangs_actual) \ + reduction(max:result) + { + num_gangs_actual = 1 + __builtin_goacc_parlevel_id(GOMP_DIM_GANG); + + static int var = var_init; + +#pragma acc atomic capture + result = ++var; + + /* Irrespective of the order in which the gang-redundant threads + execute, 'var' has now been incremented 'num_gangs_actual' times, and + the final value captured as 'result'. */ + } + if (acc_get_device_type() == acc_device_host) + assert(num_gangs_actual == 1); + else + assert(num_gangs_actual == num_gangs_request); + assert(result == var_init + num_gangs_actual * (1 + i)); + } +} + + +/* Check the same routine called from two compute constructs. */ + +static const int t1_r2_var_init = 166; + +#pragma acc routine gang +__attribute__((noinline)) +static int t1_r2_r(void) +{ + static int var = t1_r2_var_init; + + int tmp; +#pragma acc atomic capture + tmp = ++var; + + return tmp; +} + +static void t1_r2(void) +{ + IF_DEBUG + __builtin_printf ("%s\n", __FUNCTION__); + + const int i_limit = 71; + /* The checking assumes the same 'num_gangs' for all compute constructs. */ + const int num_gangs_request = 333; + int num_gangs_actual = -1; + if (acc_get_device_type() == acc_device_host) + num_gangs_actual = 1; + else + { + /* We're assuming that the implementation is able to accomodate the + 'num_gangs' requested (which really ought to be true for + 'num_gangs'). */ + num_gangs_actual = num_gangs_request; + } + + for (int i = 0; i < i_limit; ++i) + { + int result_1 = 0; +#pragma acc parallel \ + num_gangs(num_gangs_request) \ + reduction(max:result_1) + { + result_1 = t1_r2_r(); + + /* Irrespective of the order in which the gang-redundant threads + execute, 'var' has now been incremented 'num_gangs_actual' times, and + the final value captured as 'result_1'. */ + } + IF_DEBUG + __builtin_printf ("%d: result_1: %d\n", i, result_1); + assert(result_1 == t1_r2_var_init + num_gangs_actual * (1 + (i * 3 + 0))); + + int result_2 = 0; +#pragma acc parallel \ + num_gangs(num_gangs_request) \ + reduction(max:result_2) + { + result_2 = t1_r2_r() + t1_r2_r(); + + /* Irrespective of the order in which the gang-redundant threads + execute, 'var' has now been incremented '2 * num_gangs_actual' times. + However, the order of the two 't1_r2_r' function calls is not + synchronized (between different gang-redundant threads). We thus + cannot verify the actual 'result_2' values in this case. */ + } + IF_DEBUG + __builtin_printf ("%d: result_2: %d\n", i, result_2); + if (num_gangs_actual == 1) + /* Per the rationale above, only in this case we can check the actual + result. */ + assert(result_2 == (t1_r2_var_init + num_gangs_actual * (1 + (i * 3 + 1)) + + t1_r2_var_init + num_gangs_actual * (1 + (i * 3 + 2)))); + /* But we can generally check low and high limits. */ + { + /* Must be bigger than '2 * result_1'. */ + int c = 2 * result_1; + IF_DEBUG + __builtin_printf (" > %d\n", c); + assert(result_2 > c); + } + { + /* ..., but limited by the base value for next 'i'. */ + int c = 2 * (t1_r2_var_init + num_gangs_actual * (0 + ((i + 1) * 3 + 0))); + IF_DEBUG + __builtin_printf (" < %d\n", c); + assert(result_2 < c); + } + } +} + + +/* Asynchronous execution. */ + +static const int t2_var_init_2 = -55; + +#pragma acc routine gang +__attribute__((noinline)) +static int t2_r(void) +{ + static int var = t2_var_init_2; + + int tmp; +#pragma acc atomic capture + tmp = ++var; + + return tmp; +} + +static void t2(void) +{ + IF_DEBUG + __builtin_printf ("%s\n", __FUNCTION__); + + const int i_limit = 12; + const int num_gangs_request_1 = 14; + const int var_init_1 = 5; + int results_1[i_limit][num_gangs_request_1]; + memset (results_1, 0, sizeof results_1); + const int num_gangs_request_2 = 5; + int results_2[i_limit][num_gangs_request_2]; + memset (results_2, 0, sizeof results_2); + const int num_gangs_request_3 = 34; + const int var_init_3 = 1250; + int results_3[i_limit][num_gangs_request_3]; + memset (results_3, 0, sizeof results_3); + +#pragma acc data \ + copy(results_1, results_2, results_3) + { + for (int i = 0; i < i_limit; ++i) + { + /* The following 'async' clauses effect asynchronous execution, but + using the same async-argument for each compute construct implies that + the respective compute constructs' execution is synchronized with + itself, meaning that all 'i = 0' execution has finished (on the + device) before 'i = 1' is started (on the device), etc. */ + +#pragma acc parallel \ + present(results_1) \ + num_gangs(num_gangs_request_1) \ + async(1) + { + static int var = var_init_1; + + int tmp; +#pragma acc atomic capture + tmp = ++var; + + results_1[i][__builtin_goacc_parlevel_id(GOMP_DIM_GANG)] += tmp; + } + +#pragma acc parallel \ + present(results_2) \ + num_gangs(num_gangs_request_2) \ + async(2) + { + results_2[i][__builtin_goacc_parlevel_id(GOMP_DIM_GANG)] += t2_r(); + } + +#pragma acc parallel \ + present(results_3) \ + num_gangs(num_gangs_request_3) \ + async(3) + { + static int var = var_init_3; + + int tmp; +#pragma acc atomic capture + tmp = ++var; + + results_3[i][__builtin_goacc_parlevel_id(GOMP_DIM_GANG)] += tmp; + } + } +#pragma acc wait + } + int num_gangs_actual_1; + int num_gangs_actual_2; + int num_gangs_actual_3; + if (acc_get_device_type() == acc_device_host) + { + num_gangs_actual_1 = 1; + num_gangs_actual_2 = 1; + num_gangs_actual_3 = 1; + } + else + { + /* We're assuming that the implementation is able to accomodate the + 'num_gangs' requested (which really ought to be true for + 'num_gangs'). */ + num_gangs_actual_1 = num_gangs_request_1; + num_gangs_actual_2 = num_gangs_request_2; + num_gangs_actual_3 = num_gangs_request_3; + } + + /* For 'i = 0', 'results_*[i][0..num_gangs_actual_*]' are expected to each + contain one value of '(1 + var_init_*)..(var_init_* + num_gangs_actual_*)', + and so on for increasing 'i'. Their order however is unspecified due to + the gang-redundant execution. (Thus checking that their sums match.) */ + + int result_1 = 0; + int result_2 = 0; + int result_3 = 0; + for (int i = 0; i < i_limit; ++i) + { + int result_1_ = 0; + for (int g = 0; g < num_gangs_actual_1; ++g) + { + IF_DEBUG + __builtin_printf ("results_1[%d][%d]: %d\n", i, g, results_1[i][g]); + result_1_ += results_1[i][g]; + } + IF_DEBUG + __builtin_printf ("%d result_1_: %d\n", i, result_1_); + assert (result_1_ == (((var_init_1 + num_gangs_actual_1 * (1 + i)) * (1 + var_init_1 + num_gangs_actual_1 * (1 + i)) / 2) + - ((var_init_1 + num_gangs_actual_1 * (0 + i)) * (1 + var_init_1 + num_gangs_actual_1 * (0 + i)) / 2))); + result_1 += result_1_; + + int result_2_ = 0; + for (int g = 0; g < num_gangs_actual_2; ++g) + { + IF_DEBUG + __builtin_printf ("results_2[%d][%d]: %d\n", i, g, results_2[i][g]); + result_2_ += results_2[i][g]; + } + IF_DEBUG + __builtin_printf ("%d result_2_: %d\n", i, result_2_); + assert (result_2_ == (((t2_var_init_2 + num_gangs_actual_2 * (1 + i)) * (1 + t2_var_init_2 + num_gangs_actual_2 * (1 + i)) / 2) + - ((t2_var_init_2 + num_gangs_actual_2 * (0 + i)) * (1 + t2_var_init_2 + num_gangs_actual_2 * (0 + i)) / 2))); + result_2 += result_2_; + + int result_3_ = 0; + for (int g = 0; g < num_gangs_actual_3; ++g) + { + IF_DEBUG + __builtin_printf ("results_3[%d][%d]: %d\n", i, g, results_3[i][g]); + result_3_ += results_3[i][g]; + } + IF_DEBUG + __builtin_printf ("%d result_3_: %d\n", i, result_3_); + assert (result_3_ == (((var_init_3 + num_gangs_actual_3 * (1 + i)) * (1 + var_init_3 + num_gangs_actual_3 * (1 + i)) / 2) + - ((var_init_3 + num_gangs_actual_3 * (0 + i)) * (1 + var_init_3 + num_gangs_actual_3 * (0 + i)) / 2))); + result_3 += result_3_; + } + IF_DEBUG + __builtin_printf ("result_1: %d\n", result_1); + assert (result_1 == (((var_init_1 + num_gangs_actual_1 * i_limit) * (1 + var_init_1 + num_gangs_actual_1 * i_limit) / 2) + - (var_init_1 * (var_init_1 + 1) / 2))); + IF_DEBUG + __builtin_printf ("result_2: %d\n", result_2); + assert (result_2 == (((t2_var_init_2 + num_gangs_actual_2 * i_limit) * (1 + t2_var_init_2 + num_gangs_actual_2 * i_limit) / 2) + - (t2_var_init_2 * (t2_var_init_2 + 1) / 2))); + IF_DEBUG + __builtin_printf ("result_3: %d\n", result_3); + assert (result_3 == (((var_init_3 + num_gangs_actual_3 * i_limit) * (1 + var_init_3 + num_gangs_actual_3 * i_limit) / 2) + - (var_init_3 * (var_init_3 + 1) / 2))); +} + + +#pragma acc routine seq +__attribute__((noinline)) +static int pr84991_1_r_s(int n) +{ + static const int test[] = {1,2,3,4}; + return test[n]; +} + +static void pr84991_1(void) +{ + int n[1]; + n[0] = 3; +#pragma acc parallel copy(n) + { + n[0] = pr84991_1_r_s(n[0]); + } + assert(n[0] == 4); +} + + +static void pr84992_1(void) +{ + int n[1]; + n[0] = 3; +#pragma acc parallel copy(n) + { + static const int test[] = {1,2,3,4}; + n[0] = test[n[0]]; + } + assert(n[0] == 4); +} + + +int main(void) +{ + t0_c(); + + t0_r(); + + t1_c(); + + t1_r2(); + + t2(); + + pr84991_1(); + + pr84992_1(); + + return 0; +} -- 2.30.2