> -----Original Message----- > From: Richard Biener <rguent...@suse.de> > Sent: 17 October 2024 19:18 > To: Prathamesh Kulkarni <prathame...@nvidia.com> > Cc: gcc@gcc.gnu.org; Thomas Schwinge <tschwi...@baylibre.com> > Subject: RE: [RFC] Enabling SVE with offloading to nvptx > > External email: Use caution opening links or attachments > > > On Thu, 17 Oct 2024, Prathamesh Kulkarni wrote: > > > > -----Original Message----- > > > From: Richard Biener <rguent...@suse.de> > > > Sent: 16 October 2024 13:05 > > > To: Prathamesh Kulkarni <prathame...@nvidia.com> > > > Cc: gcc@gcc.gnu.org; Thomas Schwinge <tschwi...@baylibre.com> > > > Subject: Re: [RFC] Enabling SVE with offloading to nvptx > > > > > > External email: Use caution opening links or attachments > > > > > > > > > On Tue, 15 Oct 2024, Prathamesh Kulkarni wrote: > > > > > > > Hi, > > > > Testing libgomp with SVE enabled (-mcpu=generic+sve2), results > in > > > > ~60 > > > UNRESOLVED errors with following error message: > > > > > > > > lto1: fatal error: degree of 'poly_int' exceeds > 'NUM_POLY_INT_COEFFS' > > > > compilation terminated. > > > > nvptx mkoffload: fatal error: > > > > ../../install/bin/aarch64-unknown-linux-gnu-accel-nvptx-none-gcc > > > returned 1 exit status compilation terminated. > > > > > > > > This behaviour can be reproduced with the following simple > > > > test-case > > > with -fopenmp -foffload=nvptx-none -mcpu=generic+sve2: > > > > > > > > #define N 1000 > > > > int main () > > > > { > > > > int i; > > > > int A[N] = {0}, B[N] = {0}; > > > > > > > > #pragma omp target map(i), map(tofrom: A), map(from: B) > > > > #pragma omp simd > > > > for (i = 0; i < N; i++) > > > > A[i] = A[i] + B[i]; > > > > return A[0]; > > > > } > > > > > > > > omplower pass lowers the above loop to the following: > > > > > > > > D.4576 = .GOMP_USE_SIMT (); > > > > if (D.4576 != 0) goto <D.4577>; else goto > <D.4578>; > > > > <D.4577>: > > > > { > > > > unsigned int D.4586; > > > > unsigned int D.4587; > > > > int D.4588; > > > > void * simduid.5; > > > > void * .omp_simt.6; > > > > int D.4596; > > > > _Bool D.4597; > > > > int D.4598; > > > > unsigned int D.4599; > > > > int D.4600; > > > > int D.4601; > > > > int * D.4602; > > > > int i [value-expr: D.4588]; > > > > int i.0; > > > > > > > > simduid.5 = .GOMP_SIMT_ENTER (simduid.5, > &D.4588); > > > > .omp_simt.6 = .GOMP_SIMT_ENTER_ALLOC > (simduid.5); > > > > D.4587 = 0; > > > > i.0 = 0; > > > > #pragma omp simd safelen(32) > > > > _simduid_(simduid.5) > > > _simt_ linear(i.0:1) linear(i:1) > > > > for (i.0 = 0; i.0 < 1000; i.0 = i.0 + 1) > > > > ... > > > > } > > > > goto <D.4579>; > > > > <D.4578>: > > > > { > > > > unsigned int D.4603; > > > > unsigned int D.4604; > > > > int D.4605[0:POLY_INT_CST [15, 16]]; > > > > void * simduid.7; > > > > unsigned int D.4612; > > > > int * D.4613; > > > > int D.4614; > > > > int i [value-expr: D.4605[D.4604]]; > > > > int i.0; > > > > > > > > D.4604 = 0; > > > > i.0 = 0; > > > > #pragma omp simd safelen(POLY_INT_CST [16, > 16]) > > > _simduid_(simduid.7) linear(i.0:1) linear(i:1) > > > > ... > > > > } > > > > <D.4579>: > > > > ... > > > > > > > > For offloading to SIMT based device like nvptx, scan_omp_simd > > > > duplicates lowering of simd pragma into if-else where the if- > part > > > > contains simt code-path, and else-part contains simd code-path. > In > > > lower_rec_simd_input_clauses, max_vf is set to 16+16x for the > above > > > case as determined by omp_max_vf, and that becomes length of the > omp > > > simd > > > array: > > > > int D.4605[0:POLY_INT_CST [15, 16]]; > > > > > > > > The issue here is that, the function containing above if-else > > > > condition gets streamed out to LTO bytecode including the simd > > > > code- > > > path and the omp simd array, whose domain is [0:POLY_INT_CST[15, > > > 16]], and thus we get the above error while streaming-in > > > POLY_INT_CST in lto_input_ts_poly_tree_pointers on device side. > > > > > > > > Note that, the simd code-path is essentially dead-code on nvptx, > > > > since > > > > .GOMP_USE_SIMT() resolves to 1 during omp_device_lower pass, and > > > > later optimization passes (ccp2) remove the dead-code path and > > > > unused omp > > > simd arrays while compiling to device. So in this case, we aren't > > > really mapping POLY_INT_CST from host to device, but it gets > > > streamed out to device as an artefact of omp simd lowering. > > > > > > > > I suppose a proper fix here would be to (somehow) defer lowering > > > > of omp pragma simd after streaming out to device, so the device > > > > only sees simt code-path, and the host only sees simd code path > ? > > > > Or perhaps > > > clone each function in offload region, one for host and one for > SIMT > > > device, and only stream the device versions to avoid streaming out > > > host- specific IR changes ? > > > > > > There is currently no way to have the host compiler query offload > > > target capabilities so the only true fix is to delay OMP SIMD > > > lowering to the target. > > Um, I thought we could use omp_max_simt_vf from host to query if the > offload target is SIMT ? > > The function essentially iterates thru OFFLOAD_TARGET_NAMES and > returns non-zero for nvptx. > > > > > > Are we dependent on the early optimization pipeline being run on > the > > > host to produce the offload IL? There's some oddball OACC passes > in > > > pass_ipa_oacc. > > > > > > That said, I'd probably try to produce clones with unlowered IL > and > > > skip those clones from all processing from that point and resume > in > > > the offload compiler. > > > > > > > I thought of following approaches as workarounds: > > > > > > I don't think any workaround will fly in the end. Can't you > simply > > > force SVE to be off for offload clones on the host side and force > > > OMP lowering with ADVSIMD only? > > Would it be correct to set: > > sctx.max_vf = constant_lower_bound (omp_max_vf ()) > > > > if function is offloaded and omp_max_vf returns non-constant > poly_int, > > to force max_vf to be VLS, which will avoid VLA vectorization as in > the attached patch ? > > > > Or should we modify autovectorize_vector_modes hook to return VLS > > modes for offloaded functions ? > > Can't you simply put a target(march=armv8.3a) (or even more basic ISA) > on the OMP target clones? Or is the lowering happening before > outlining? AFAIU, scan_omp_target in omp-lower creates a "child function" but doesn't outline it. The actual outlining seems to be happening during omp-expand pass.
> In that case maybe we want to switch the host target into "offload- > lowering-mode" (effectively switching to a basic ISA)? Sorry, I didn't understand -- Do you mean we should enforce basic ISA for all the functions including host-only ones ? Since the issue stems from having host details leaked into IL during omp-lowering (and omp-expand), I was wondering if we should use default versions of the hooks if offloading is enabled in omp-lower and omp-expand, which should avoid host details creeping into offload IL, sth like in the attached patch ? Thanks, Prathamesh > > > Thanks, > > Prathamesh > > > > > > Richard. > > > > > > > [1] Set sctx.max_vf to constant_lower_bound(omp_max_vf ()) in > > > > lower_rec_simd_input_clauses, if the function is going to be > > > > offloaded and omp_max_vf returns non-constant poly_int. For > above > > > > case, it sets > > > max_vf to 16 instead of 16+16x which seems to resolve the issue, > but > > > it'd use suboptimal max VF for host ? This is done in patch p-283- > 2.txt. > > > > > > > > However, with clean trunk it still seems to use max_vf = 16 > after > > > disabling the above error. > > > > vect dump shows: > > > > > > > > (compute_affine_dependence > > > > ref_a: (*_25)[i.0_51], stmt_a: _26 = (*_25)[i.0_51]; > > > > ref_b: (*_23)[i.0_51], stmt_b: (*_23)[i.0_51] = _27; > > > > ) -> dependence analysis failed > > > > foo.c:10:13: note: dependence distance = 0. > > > > foo.c:10:13: note: dependence distance == 0 between > (*_23)[i.0_51] > > > and (*_23)[i.0_51] > > > > foo.c:10:13: missed: bad data dependence. > > > > foo.c:10:13: note: ***** Analysis failed with vector mode > VNx4SI > > > > > > > > This seems to happen because, loop->safelen is set to 16 by > taking > > > > MIN(constant_lower_bound(16+16x), INT_MAX) in expand_omp_simd: > > > > > > > > if (!poly_int_tree_p (safelen, &val)) > > > > safelen_int = 0; > > > > else > > > > safelen_int = MIN (constant_lower_bound (val), INT_MAX); > > > > > > > > and fails to vectorize with VLA vectors, because max_vf == 16 > and > > > min_vf == 4+4x resulting in bad data dependence due to: > > > > > > > > if (max_vf != MAX_VECTORIZATION_FACTOR > > > > && maybe_lt (max_vf, min_vf)) > > > > return opt_result::failure_at (vect_location, "bad data > > > > dependence.\n"); > > > > > > > > If safelen was (somehow) set to 16+16x, I guess it could have > used > > > VF=4+4x and vectorized with VLA vectors. > > > > but I suppose that's a separate issue ? > > > > > > > > [2] Since the issue seems to be only with streaming out length > of > > > > omp simd array when it's POLY_INT_CST, could we perhaps use a > > > > place holder length during omp lowering and compute the correct > > > > length after streaming out, so POLY_INT_CST doesn't get leaked > > > > into bytecode ? The > > > attached patch p-283-3.txt follows this approach by using bogus > > > length INT_MAX in lower_rec_simd_input_clauses if offloading to > SIMT > > > device and max_vf is non-constant poly_int, and later computing > the > > > correct length in beginning of vect pass by setting it to > omp_max_vf > > > (), but I am not sure if this is entirely correct. > > > > I am assuming that creating omp simd array of bogus length will > > > > not be an issue for nvptx since it will never get referenced and > > > > eventually be removed by remove_unused_locals ? If it'd not be a > > > > good idea to > > > rely on the pass pipeline to eliminate simd code-path and omp simd > > > array while compiling to device, it could be possibly done during > > > omp_lower_device pass itself ? > > > > > > > > [3] While streaming-in POLY_INT_CST, avoid emitting error > > > > immediately if degree of POLY_INT_CST exceeds accel's > > > > NUM_POLY_INT_COEFFS to ignore POLY_INT_CSTs that may potentially > > > > occur on dead-code path, and instead mark it as error_mark_node. > > > > For the above case, since > > > POLY_INT_CST appears on dead-code path, streaming POLY_INT_CST > with > > > higher degree than accel's NUM_POLY_INT_COEFFS would be > "harmless". > > > And detect invalid POLY_INT_CST's in expand pass (if it survives > > > till this point), and emit above error, but not sure if that'd be > > > the right place ? > > > > This is done in p-283-4.txt. > > > > > > > > All the three patches fix UNRESOLVED tests due to POLY_INT_CST > > > streaming error in libgomp testsuite with -mcpu=generic+sve2. > > > > (Altho it introduces a strange FAIL for data-5.f90, which I am > > > investigating). > > > > I would be grateful for suggestions on how to proceed. > > > > > > > > Signed-off-by: Prathamesh Kulkarni <prathame...@nvidia.com> > > > > > > > > Thanks, > > > > Prathamesh > > > > > > > > > > -- > > > Richard Biener <rguent...@suse.de> > > > SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 > > > Nuernberg, Germany; > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG > > > Nuernberg) > > > > -- > Richard Biener <rguent...@suse.de> > SUSE Software Solutions Germany GmbH, > Frankenstrasse 146, 90461 Nuernberg, Germany; > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG > Nuernberg)
Use default versions of target hooks during omplower and ompexpand when offloading. gcc/ * omp-expand.cc (omp_adjust_chunk_size ()): Call omp_maybe_offloaded and pass it's value to omp_max_vf. * omp-general.cc: Include targhooks.h. (omp_max_vf): New parameter offload. Call default_autovectorize_vector_modes and default_preferred_simd_mode if offload is true. (omp_maybe_offloaded): Export. * omp-general.h (omp_max_vf): New parameter offload with default value false. (omp_maybe_offloaded): Declare. * omp-low.cc: Include targhooks.h. (omp_clause_aligned_alignment): Call default_autovectorize_vector_modes and default_preferred_simd_mode if offload is true. (lower_rec_simd_input_clauses): Call omp_maybe_offloaded_ctx and pass it's value to omp_max_vf. (lower_rec_input_clauses): Call omp_maybe_offloaded_ctx and pass it's value to omp_clause_aligned_alignment. Signed-off-by: Prathamesh Kulkarni <prathame...@nvidia.com> diff --git a/gcc/omp-expand.cc b/gcc/omp-expand.cc index 9ff9553c3ea..37ba36fb9b1 100644 --- a/gcc/omp-expand.cc +++ b/gcc/omp-expand.cc @@ -211,7 +211,7 @@ omp_adjust_chunk_size (tree chunk_size, bool simd_schedule) if (!simd_schedule || integer_zerop (chunk_size)) return chunk_size; - poly_uint64 vf = omp_max_vf (); + poly_uint64 vf = omp_max_vf (omp_maybe_offloaded ()); if (known_eq (vf, 1U)) return chunk_size; diff --git a/gcc/omp-general.cc b/gcc/omp-general.cc index f4c5f577047..7a03fdcaf49 100644 --- a/gcc/omp-general.cc +++ b/gcc/omp-general.cc @@ -44,6 +44,7 @@ along with GCC; see the file COPYING3. If not see #include "streamer-hooks.h" #include "opts.h" #include "tree-pretty-print.h" +#include "targhooks.h" enum omp_requires omp_requires_mask; @@ -989,7 +990,7 @@ find_combined_omp_for (tree *tp, int *walk_subtrees, void *data) /* Return maximum possible vectorization factor for the target. */ poly_uint64 -omp_max_vf (void) +omp_max_vf (bool offload) { if (!optimize || optimize_debug @@ -999,7 +1000,10 @@ omp_max_vf (void) return 1; auto_vector_modes modes; - targetm.vectorize.autovectorize_vector_modes (&modes, true); + if (offload) + default_autovectorize_vector_modes (&modes, true); + else + targetm.vectorize.autovectorize_vector_modes (&modes, true); if (!modes.is_empty ()) { poly_uint64 vf = 0; @@ -1011,7 +1015,9 @@ omp_max_vf (void) return vf; } - machine_mode vqimode = targetm.vectorize.preferred_simd_mode (QImode); + machine_mode vqimode + = (offload) ? default_preferred_simd_mode (QImode) + : targetm.vectorize.preferred_simd_mode (QImode); if (GET_MODE_CLASS (vqimode) == MODE_VECTOR_INT) return GET_MODE_NUNITS (vqimode); @@ -1107,7 +1113,7 @@ omp_offload_device_kind_arch_isa (const char *props, const char *prop) Return true in declare target functions, or when nested in a target region or when unsure, return false otherwise. */ -static bool +bool omp_maybe_offloaded (void) { if (!ENABLE_OFFLOADING) diff --git a/gcc/omp-general.h b/gcc/omp-general.h index 891f467556e..6cf0829d13b 100644 --- a/gcc/omp-general.h +++ b/gcc/omp-general.h @@ -162,7 +162,7 @@ extern void omp_extract_for_data (gomp_for *for_stmt, struct omp_for_data *fd, struct omp_for_data_loop *loops); extern gimple *omp_build_barrier (tree lhs); extern tree find_combined_omp_for (tree *, int *, void *); -extern poly_uint64 omp_max_vf (void); +extern poly_uint64 omp_max_vf (bool offload=false); extern int omp_max_simt_vf (void); extern const char *omp_context_name_list_prop (tree); extern void omp_construct_traits_to_codes (tree, int, enum tree_code *); @@ -186,6 +186,7 @@ extern tree oacc_get_fn_attrib (tree fn); extern bool offloading_function_p (tree fn); extern int oacc_get_fn_dim_size (tree fn, int axis); extern int oacc_get_ifn_dim_arg (const gimple *stmt); +extern bool omp_maybe_offloaded (void); enum omp_requires { OMP_REQUIRES_ATOMIC_DEFAULT_MEM_ORDER = 0xf, diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc index da2051b0279..2c09bcebee1 100644 --- a/gcc/omp-low.cc +++ b/gcc/omp-low.cc @@ -60,6 +60,7 @@ along with GCC; see the file COPYING3. If not see #include "stringpool.h" #include "attribs.h" #include "omp-offload.h" +#include "targhooks.h" /* Lowering of OMP parallel and workshare constructs proceeds in two phases. The first phase scans the function looking for OMP statements @@ -4523,7 +4524,7 @@ omp_reduction_init (tree clause, tree type) OMP_CLAUSE_ALIGNED. */ static tree -omp_clause_aligned_alignment (tree clause) +omp_clause_aligned_alignment (tree clause, bool offload) { if (OMP_CLAUSE_ALIGNED_ALIGNMENT (clause)) return OMP_CLAUSE_ALIGNED_ALIGNMENT (clause); @@ -4532,7 +4533,11 @@ omp_clause_aligned_alignment (tree clause) unsigned int al = 1; opt_scalar_mode mode_iter; auto_vector_modes modes; - targetm.vectorize.autovectorize_vector_modes (&modes, true); + if (offload) + default_autovectorize_vector_modes (&modes, true); + else + targetm.vectorize.autovectorize_vector_modes (&modes, true); + static enum mode_class classes[] = { MODE_INT, MODE_VECTOR_INT, MODE_FLOAT, MODE_VECTOR_FLOAT }; for (int i = 0; i < 4; i += 2) @@ -4540,7 +4545,9 @@ omp_clause_aligned_alignment (tree clause) FOR_EACH_MODE_IN_CLASS (mode_iter, classes[i]) { scalar_mode mode = mode_iter.require (); - machine_mode vmode = targetm.vectorize.preferred_simd_mode (mode); + machine_mode vmode + = (offload) ? default_preferred_simd_mode (mode) + : targetm.vectorize.preferred_simd_mode (mode); if (GET_MODE_CLASS (vmode) != classes[i + 1]) continue; machine_mode alt_vmode; @@ -4588,7 +4595,8 @@ lower_rec_simd_input_clauses (tree new_var, omp_context *ctx, { if (known_eq (sctx->max_vf, 0U)) { - sctx->max_vf = sctx->is_simt ? omp_max_simt_vf () : omp_max_vf (); + bool offload = omp_maybe_offloaded_ctx (ctx); + sctx->max_vf = sctx->is_simt ? omp_max_simt_vf () : omp_max_vf (offload); if (maybe_gt (sctx->max_vf, 1U)) { tree c = omp_find_clause (gimple_omp_for_clauses (ctx->stmt), @@ -5106,7 +5114,8 @@ lower_rec_input_clauses (tree clauses, gimple_seq *ilist, gimple_seq *dlist, if (new_var == NULL_TREE) new_var = maybe_lookup_decl_in_outer_ctx (var, ctx); x = builtin_decl_explicit (BUILT_IN_ASSUME_ALIGNED); - tree alarg = omp_clause_aligned_alignment (c); + bool offload = omp_maybe_offloaded_ctx (ctx); + tree alarg = omp_clause_aligned_alignment (c, offload); alarg = fold_convert_loc (clause_loc, size_type_node, alarg); x = build_call_expr_loc (clause_loc, x, 2, new_var, alarg); x = fold_convert_loc (clause_loc, TREE_TYPE (new_var), x); @@ -5121,7 +5130,8 @@ lower_rec_input_clauses (tree clauses, gimple_seq *ilist, gimple_seq *dlist, t = maybe_lookup_decl_in_outer_ctx (var, ctx); t = build_fold_addr_expr_loc (clause_loc, t); t2 = builtin_decl_explicit (BUILT_IN_ASSUME_ALIGNED); - tree alarg = omp_clause_aligned_alignment (c); + bool offload = omp_maybe_offloaded_ctx (ctx); + tree alarg = omp_clause_aligned_alignment (c, offload); alarg = fold_convert_loc (clause_loc, size_type_node, alarg); t = build_call_expr_loc (clause_loc, t2, 2, t, alarg); t = fold_convert_loc (clause_loc, ptype, t);