> -----Original Message----- > From: Richard Biener <rguent...@suse.de> > Sent: 16 October 2024 13:05 > To: Prathamesh Kulkarni <prathame...@nvidia.com> > Cc: gcc@gcc.gnu.org; Thomas Schwinge <tschwi...@baylibre.com> > Subject: Re: [RFC] Enabling SVE with offloading to nvptx > > External email: Use caution opening links or attachments > > > On Tue, 15 Oct 2024, Prathamesh Kulkarni wrote: > > > Hi, > > Testing libgomp with SVE enabled (-mcpu=generic+sve2), results in ~60 > UNRESOLVED errors with following error message: > > > > lto1: fatal error: degree of 'poly_int' exceeds 'NUM_POLY_INT_COEFFS' > > compilation terminated. > > nvptx mkoffload: fatal error: > > ../../install/bin/aarch64-unknown-linux-gnu-accel-nvptx-none-gcc > returned 1 exit status compilation terminated. > > > > This behaviour can be reproduced with the following simple test-case > with -fopenmp -foffload=nvptx-none -mcpu=generic+sve2: > > > > #define N 1000 > > int main () > > { > > int i; > > int A[N] = {0}, B[N] = {0}; > > > > #pragma omp target map(i), map(tofrom: A), map(from: B) > > #pragma omp simd > > for (i = 0; i < N; i++) > > A[i] = A[i] + B[i]; > > return A[0]; > > } > > > > omplower pass lowers the above loop to the following: > > > > D.4576 = .GOMP_USE_SIMT (); > > if (D.4576 != 0) goto <D.4577>; else goto <D.4578>; > > <D.4577>: > > { > > unsigned int D.4586; > > unsigned int D.4587; > > int D.4588; > > void * simduid.5; > > void * .omp_simt.6; > > int D.4596; > > _Bool D.4597; > > int D.4598; > > unsigned int D.4599; > > int D.4600; > > int D.4601; > > int * D.4602; > > int i [value-expr: D.4588]; > > int i.0; > > > > simduid.5 = .GOMP_SIMT_ENTER (simduid.5, &D.4588); > > .omp_simt.6 = .GOMP_SIMT_ENTER_ALLOC (simduid.5); > > D.4587 = 0; > > i.0 = 0; > > #pragma omp simd safelen(32) _simduid_(simduid.5) > _simt_ linear(i.0:1) linear(i:1) > > for (i.0 = 0; i.0 < 1000; i.0 = i.0 + 1) > > ... > > } > > goto <D.4579>; > > <D.4578>: > > { > > unsigned int D.4603; > > unsigned int D.4604; > > int D.4605[0:POLY_INT_CST [15, 16]]; > > void * simduid.7; > > unsigned int D.4612; > > int * D.4613; > > int D.4614; > > int i [value-expr: D.4605[D.4604]]; > > int i.0; > > > > D.4604 = 0; > > i.0 = 0; > > #pragma omp simd safelen(POLY_INT_CST [16, 16]) > _simduid_(simduid.7) linear(i.0:1) linear(i:1) > > ... > > } > > <D.4579>: > > ... > > > > For offloading to SIMT based device like nvptx, scan_omp_simd > > duplicates lowering of simd pragma into if-else where the if-part > > contains simt code-path, and else-part contains simd code-path. In > lower_rec_simd_input_clauses, max_vf is set to 16+16x for the above case > as determined by omp_max_vf, and that becomes length of the omp simd > array: > > int D.4605[0:POLY_INT_CST [15, 16]]; > > > > The issue here is that, the function containing above if-else > > condition gets streamed out to LTO bytecode including the simd code- > path and the omp simd array, whose domain is [0:POLY_INT_CST[15, 16]], > and thus we get the above error while streaming-in POLY_INT_CST in > lto_input_ts_poly_tree_pointers on device side. > > > > Note that, the simd code-path is essentially dead-code on nvptx, since > > .GOMP_USE_SIMT() resolves to 1 during omp_device_lower pass, and later > > optimization passes (ccp2) remove the dead-code path and unused omp > simd arrays while compiling to device. So in this case, we aren't really > mapping POLY_INT_CST from host to device, but it gets streamed out to > device as an artefact of omp simd lowering. > > > > I suppose a proper fix here would be to (somehow) defer lowering of > > omp pragma simd after streaming out to device, so the device only sees > > simt code-path, and the host only sees simd code path ? Or perhaps > clone each function in offload region, one for host and one for SIMT > device, and only stream the device versions to avoid streaming out host- > specific IR changes ? > > There is currently no way to have the host compiler query offload target > capabilities so the only true fix is to delay OMP SIMD lowering to the > target. Um, I thought we could use omp_max_simt_vf from host to query if the offload target is SIMT ? The function essentially iterates thru OFFLOAD_TARGET_NAMES and returns non-zero for nvptx. > > Are we dependent on the early optimization pipeline being run on the > host to produce the offload IL? There's some oddball OACC passes in > pass_ipa_oacc. > > That said, I'd probably try to produce clones with unlowered IL and skip > those clones from all processing from that point and resume in the > offload compiler. > > > I thought of following approaches as workarounds: > > I don't think any workaround will fly in the end. Can't you simply > force SVE to be off for offload clones on the host side and force OMP > lowering with ADVSIMD only? Would it be correct to set: sctx.max_vf = constant_lower_bound (omp_max_vf ())
if function is offloaded and omp_max_vf returns non-constant poly_int, to force max_vf to be VLS, which will avoid VLA vectorization as in the attached patch ? Or should we modify autovectorize_vector_modes hook to return VLS modes for offloaded functions ? Thanks, Prathamesh > > Richard. > > > [1] Set sctx.max_vf to constant_lower_bound(omp_max_vf ()) in > > lower_rec_simd_input_clauses, if the function is going to be offloaded > > and omp_max_vf returns non-constant poly_int. For above case, it sets > max_vf to 16 instead of 16+16x which seems to resolve the issue, but > it'd use suboptimal max VF for host ? This is done in patch p-283-2.txt. > > > > However, with clean trunk it still seems to use max_vf = 16 after > disabling the above error. > > vect dump shows: > > > > (compute_affine_dependence > > ref_a: (*_25)[i.0_51], stmt_a: _26 = (*_25)[i.0_51]; > > ref_b: (*_23)[i.0_51], stmt_b: (*_23)[i.0_51] = _27; > > ) -> dependence analysis failed > > foo.c:10:13: note: dependence distance = 0. > > foo.c:10:13: note: dependence distance == 0 between (*_23)[i.0_51] > and (*_23)[i.0_51] > > foo.c:10:13: missed: bad data dependence. > > foo.c:10:13: note: ***** Analysis failed with vector mode VNx4SI > > > > This seems to happen because, loop->safelen is set to 16 by taking > > MIN(constant_lower_bound(16+16x), INT_MAX) in expand_omp_simd: > > > > if (!poly_int_tree_p (safelen, &val)) > > safelen_int = 0; > > else > > safelen_int = MIN (constant_lower_bound (val), INT_MAX); > > > > and fails to vectorize with VLA vectors, because max_vf == 16 and > min_vf == 4+4x resulting in bad data dependence due to: > > > > if (max_vf != MAX_VECTORIZATION_FACTOR > > && maybe_lt (max_vf, min_vf)) > > return opt_result::failure_at (vect_location, "bad data > > dependence.\n"); > > > > If safelen was (somehow) set to 16+16x, I guess it could have used > VF=4+4x and vectorized with VLA vectors. > > but I suppose that's a separate issue ? > > > > [2] Since the issue seems to be only with streaming out length of omp > > simd array when it's POLY_INT_CST, could we perhaps use a place holder > > length during omp lowering and compute the correct length after > > streaming out, so POLY_INT_CST doesn't get leaked into bytecode ? The > attached patch p-283-3.txt follows this approach by using bogus length > INT_MAX in lower_rec_simd_input_clauses if offloading to SIMT device and > max_vf is non-constant poly_int, and later computing the correct length > in beginning of vect pass by setting it to omp_max_vf (), but I am not > sure if this is entirely correct. > > I am assuming that creating omp simd array of bogus length will not be > > an issue for nvptx since it will never get referenced and eventually > > be removed by remove_unused_locals ? If it'd not be a good idea to > rely on the pass pipeline to eliminate simd code-path and omp simd array > while compiling to device, it could be possibly done during > omp_lower_device pass itself ? > > > > [3] While streaming-in POLY_INT_CST, avoid emitting error immediately > > if degree of POLY_INT_CST exceeds accel's NUM_POLY_INT_COEFFS to > > ignore POLY_INT_CSTs that may potentially occur on dead-code path, and > > instead mark it as error_mark_node. For the above case, since > POLY_INT_CST appears on dead-code path, streaming POLY_INT_CST with > higher degree than accel's NUM_POLY_INT_COEFFS would be "harmless". And > detect invalid POLY_INT_CST's in expand pass (if it survives till this > point), and emit above error, but not sure if that'd be the right place > ? > > This is done in p-283-4.txt. > > > > All the three patches fix UNRESOLVED tests due to POLY_INT_CST > streaming error in libgomp testsuite with -mcpu=generic+sve2. > > (Altho it introduces a strange FAIL for data-5.f90, which I am > investigating). > > I would be grateful for suggestions on how to proceed. > > > > Signed-off-by: Prathamesh Kulkarni <prathame...@nvidia.com> > > > > Thanks, > > Prathamesh > > > > -- > Richard Biener <rguent...@suse.de> > SUSE Software Solutions Germany GmbH, > Frankenstrasse 146, 90461 Nuernberg, Germany; > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG > Nuernberg)
Set max_vf to constant lower bound if max_vf is poly_int and offloading to device. gcc/ * omp-low.cc (lower_rec_simd_input_clauses): Set max_vf to constant_lower_bound (omp_max_vf()) if offloading is enabled and max_vf is POLY_INT_CST. Signed-off-by: Prathamesh Kulkarni <prathame...@nvidia.com> diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc index da2051b0279..780ea396b7f 100644 --- a/gcc/omp-low.cc +++ b/gcc/omp-low.cc @@ -4588,7 +4588,20 @@ lower_rec_simd_input_clauses (tree new_var, omp_context *ctx, { if (known_eq (sctx->max_vf, 0U)) { - sctx->max_vf = sctx->is_simt ? omp_max_simt_vf () : omp_max_vf (); + if (sctx->is_simt) + sctx->max_vf = omp_max_simt_vf (); + else + { + poly_uint64 max_vf = omp_max_vf (); + /* FIXME: If the function is going to be offloaded, + and max_vf is poly_int, use constant_lower_bound as safelen, + to avoid streaming out omp simd arrays having poly_int_cst + size. */ + if (omp_maybe_offloaded_ctx (ctx) + && !max_vf.is_constant ()) + max_vf = constant_lower_bound (max_vf); + sctx->max_vf = max_vf; + } if (maybe_gt (sctx->max_vf, 1U)) { tree c = omp_find_clause (gimple_omp_for_clauses (ctx->stmt),