On Tue, 15 Oct 2024, Prathamesh Kulkarni wrote: > Hi, > Testing libgomp with SVE enabled (-mcpu=generic+sve2), results in ~60 > UNRESOLVED errors with following error message: > > lto1: fatal error: degree of 'poly_int' exceeds 'NUM_POLY_INT_COEFFS' > compilation terminated. > nvptx mkoffload: fatal error: > ../../install/bin/aarch64-unknown-linux-gnu-accel-nvptx-none-gcc returned 1 > exit status > compilation terminated. > > This behaviour can be reproduced with the following simple test-case with > -fopenmp -foffload=nvptx-none -mcpu=generic+sve2: > > #define N 1000 > int main () > { > int i; > int A[N] = {0}, B[N] = {0}; > > #pragma omp target map(i), map(tofrom: A), map(from: B) > #pragma omp simd > for (i = 0; i < N; i++) > A[i] = A[i] + B[i]; > return A[0]; > } > > omplower pass lowers the above loop to the following: > > D.4576 = .GOMP_USE_SIMT (); > if (D.4576 != 0) goto <D.4577>; else goto <D.4578>; > <D.4577>: > { > unsigned int D.4586; > unsigned int D.4587; > int D.4588; > void * simduid.5; > void * .omp_simt.6; > int D.4596; > _Bool D.4597; > int D.4598; > unsigned int D.4599; > int D.4600; > int D.4601; > int * D.4602; > int i [value-expr: D.4588]; > int i.0; > > simduid.5 = .GOMP_SIMT_ENTER (simduid.5, &D.4588); > .omp_simt.6 = .GOMP_SIMT_ENTER_ALLOC (simduid.5); > D.4587 = 0; > i.0 = 0; > #pragma omp simd safelen(32) _simduid_(simduid.5) _simt_ > linear(i.0:1) linear(i:1) > for (i.0 = 0; i.0 < 1000; i.0 = i.0 + 1) > ... > } > goto <D.4579>; > <D.4578>: > { > unsigned int D.4603; > unsigned int D.4604; > int D.4605[0:POLY_INT_CST [15, 16]]; > void * simduid.7; > unsigned int D.4612; > int * D.4613; > int D.4614; > int i [value-expr: D.4605[D.4604]]; > int i.0; > > D.4604 = 0; > i.0 = 0; > #pragma omp simd safelen(POLY_INT_CST [16, 16]) > _simduid_(simduid.7) linear(i.0:1) linear(i:1) > ... > } > <D.4579>: > ... > > For offloading to SIMT based device like nvptx, scan_omp_simd duplicates > lowering of simd pragma into if-else where the if-part contains simt > code-path, > and else-part contains simd code-path. In lower_rec_simd_input_clauses, > max_vf is set to 16+16x for the above case as determined by omp_max_vf, > and that becomes length of the omp simd array: > int D.4605[0:POLY_INT_CST [15, 16]]; > > The issue here is that, the function containing above if-else condition gets > streamed out to LTO bytecode including the simd code-path and the omp simd > array, > whose domain is [0:POLY_INT_CST[15, 16]], and thus we get the above error > while streaming-in POLY_INT_CST in lto_input_ts_poly_tree_pointers on device > side. > > Note that, the simd code-path is essentially dead-code on nvptx, since > .GOMP_USE_SIMT() resolves to 1 during omp_device_lower pass, and later > optimization passes (ccp2) > remove the dead-code path and unused omp simd arrays while compiling to > device. So in this case, we aren't really mapping POLY_INT_CST from host to > device, > but it gets streamed out to device as an artefact of omp simd lowering. > > I suppose a proper fix here would be to (somehow) defer lowering of omp > pragma simd after streaming out to device, so the device only sees simt > code-path, > and the host only sees simd code path ? Or perhaps clone each function in > offload region, one for host and one for SIMT device, and only stream the > device versions > to avoid streaming out host-specific IR changes ?
There is currently no way to have the host compiler query offload target capabilities so the only true fix is to delay OMP SIMD lowering to the target. Are we dependent on the early optimization pipeline being run on the host to produce the offload IL? There's some oddball OACC passes in pass_ipa_oacc. That said, I'd probably try to produce clones with unlowered IL and skip those clones from all processing from that point and resume in the offload compiler. > I thought of following approaches as workarounds: I don't think any workaround will fly in the end. Can't you simply force SVE to be off for offload clones on the host side and force OMP lowering with ADVSIMD only? Richard. > [1] Set sctx.max_vf to constant_lower_bound(omp_max_vf ()) in > lower_rec_simd_input_clauses, if the function is going to be offloaded > and omp_max_vf returns non-constant poly_int. For above case, it sets max_vf > to 16 instead of 16+16x which seems to resolve the issue, > but it'd use suboptimal max VF for host ? This is done in patch p-283-2.txt. > > However, with clean trunk it still seems to use max_vf = 16 after disabling > the above error. > vect dump shows: > > (compute_affine_dependence > ref_a: (*_25)[i.0_51], stmt_a: _26 = (*_25)[i.0_51]; > ref_b: (*_23)[i.0_51], stmt_b: (*_23)[i.0_51] = _27; > ) -> dependence analysis failed > foo.c:10:13: note: dependence distance = 0. > foo.c:10:13: note: dependence distance == 0 between (*_23)[i.0_51] and > (*_23)[i.0_51] > foo.c:10:13: missed: bad data dependence. > foo.c:10:13: note: ***** Analysis failed with vector mode VNx4SI > > This seems to happen because, loop->safelen is set to 16 by taking > MIN(constant_lower_bound(16+16x), INT_MAX) in expand_omp_simd: > > if (!poly_int_tree_p (safelen, &val)) > safelen_int = 0; > else > safelen_int = MIN (constant_lower_bound (val), INT_MAX); > > and fails to vectorize with VLA vectors, because max_vf == 16 and min_vf == > 4+4x resulting in bad data dependence due to: > > if (max_vf != MAX_VECTORIZATION_FACTOR > && maybe_lt (max_vf, min_vf)) > return opt_result::failure_at (vect_location, "bad data dependence.\n"); > > If safelen was (somehow) set to 16+16x, I guess it could have used VF=4+4x > and vectorized with VLA vectors. > but I suppose that's a separate issue ? > > [2] Since the issue seems to be only with streaming out length of omp simd > array when it's POLY_INT_CST, could we perhaps use a place holder length > during omp lowering and compute the correct length after streaming out, so > POLY_INT_CST doesn't get leaked into bytecode ? The attached patch > p-283-3.txt follows this approach by using bogus length INT_MAX in > lower_rec_simd_input_clauses if offloading to SIMT device and max_vf is > non-constant > poly_int, and later computing the correct length in beginning of vect pass by > setting it to omp_max_vf (), but I am not sure if this is entirely correct. > I am assuming that creating omp simd array of bogus length will not be an > issue for nvptx since it will never get referenced and eventually be removed > by remove_unused_locals ? If it'd not be a good idea to rely on the pass > pipeline to eliminate simd code-path and omp simd array while compiling to > device, > it could be possibly done during omp_lower_device pass itself ? > > [3] While streaming-in POLY_INT_CST, avoid emitting error immediately if > degree of POLY_INT_CST exceeds accel's NUM_POLY_INT_COEFFS to ignore > POLY_INT_CSTs that > may potentially occur on dead-code path, and instead mark it as > error_mark_node. For the above case, since POLY_INT_CST appears on dead-code > path, streaming > POLY_INT_CST with higher degree than accel's NUM_POLY_INT_COEFFS would be > "harmless". And detect invalid POLY_INT_CST's in expand pass (if it survives > till this > point), and emit above error, but not sure if that'd be the right place ? > This is done in p-283-4.txt. > > All the three patches fix UNRESOLVED tests due to POLY_INT_CST streaming > error in libgomp testsuite with -mcpu=generic+sve2. > (Altho it introduces a strange FAIL for data-5.f90, which I am investigating). > I would be grateful for suggestions on how to proceed. > > Signed-off-by: Prathamesh Kulkarni <prathame...@nvidia.com> > > Thanks, > Prathamesh > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)