On Mon, Oct 28, 2024 at 1:52 PM Prathamesh Kulkarni via Gcc <gcc@gcc.gnu.org> wrote: > > > -----Original Message----- > > From: Richard Biener <rguent...@suse.de> > > Sent: 21 October 2024 12:45 > > To: Prathamesh Kulkarni <prathame...@nvidia.com> > > Cc: gcc@gcc.gnu.org; Thomas Schwinge <tschwi...@baylibre.com>; Jakub > > Jelinek <ja...@redhat.com> > > Subject: RE: [RFC] Enabling SVE with offloading to nvptx > > > > External email: Use caution opening links or attachments > > > > > > On Fri, 18 Oct 2024, Prathamesh Kulkarni wrote: > > > > > > > > > > > > -----Original Message----- > > > > From: Richard Biener <rguent...@suse.de> > > > > Sent: 17 October 2024 19:18 > > > > To: Prathamesh Kulkarni <prathame...@nvidia.com> > > > > Cc: gcc@gcc.gnu.org; Thomas Schwinge <tschwi...@baylibre.com> > > > > Subject: RE: [RFC] Enabling SVE with offloading to nvptx > > > > > > > > External email: Use caution opening links or attachments > > > > > > > > > > > > On Thu, 17 Oct 2024, Prathamesh Kulkarni wrote: > > > > > > > > > > -----Original Message----- > > > > > > From: Richard Biener <rguent...@suse.de> > > > > > > Sent: 16 October 2024 13:05 > > > > > > To: Prathamesh Kulkarni <prathame...@nvidia.com> > > > > > > Cc: gcc@gcc.gnu.org; Thomas Schwinge <tschwi...@baylibre.com> > > > > > > Subject: Re: [RFC] Enabling SVE with offloading to nvptx > > > > > > > > > > > > External email: Use caution opening links or attachments > > > > > > > > > > > > > > > > > > On Tue, 15 Oct 2024, Prathamesh Kulkarni wrote: > > > > > > > > > > > > > Hi, > > > > > > > Testing libgomp with SVE enabled (-mcpu=generic+sve2), > > results > > > > in > > > > > > > ~60 > > > > > > UNRESOLVED errors with following error message: > > > > > > > > > > > > > > lto1: fatal error: degree of 'poly_int' exceeds > > > > 'NUM_POLY_INT_COEFFS' > > > > > > > compilation terminated. > > > > > > > nvptx mkoffload: fatal error: > > > > > > > ../../install/bin/aarch64-unknown-linux-gnu-accel-nvptx- > > none-g > > > > > > > cc > > > > > > returned 1 exit status compilation terminated. > > > > > > > > > > > > > > This behaviour can be reproduced with the following simple > > > > > > > test-case > > > > > > with -fopenmp -foffload=nvptx-none -mcpu=generic+sve2: > > > > > > > > > > > > > > #define N 1000 > > > > > > > int main () > > > > > > > { > > > > > > > int i; > > > > > > > int A[N] = {0}, B[N] = {0}; > > > > > > > > > > > > > > #pragma omp target map(i), map(tofrom: A), map(from: B) > > > > > > > #pragma omp simd > > > > > > > for (i = 0; i < N; i++) > > > > > > > A[i] = A[i] + B[i]; > > > > > > > return A[0]; > > > > > > > } > > > > > > > > > > > > > > omplower pass lowers the above loop to the following: > > > > > > > > > > > > > > D.4576 = .GOMP_USE_SIMT (); > > > > > > > if (D.4576 != 0) goto <D.4577>; else goto > > > > <D.4578>; > > > > > > > <D.4577>: > > > > > > > { > > > > > > > unsigned int D.4586; > > > > > > > unsigned int D.4587; > > > > > > > int D.4588; > > > > > > > void * simduid.5; > > > > > > > void * .omp_simt.6; > > > > > > > int D.4596; > > > > > > > _Bool D.4597; > > > > > > > int D.4598; > > > > > > > unsigned int D.4599; > > > > > > > int D.4600; > > > > > > > int D.4601; > > > > > > > int * D.4602; > > > > > > > int i [value-expr: D.4588]; > > > > > > > int i.0; > > > > > > > > > > > > > > simduid.5 = .GOMP_SIMT_ENTER (simduid.5, > > > > &D.4588); > > > > > > > .omp_simt.6 = .GOMP_SIMT_ENTER_ALLOC > > > > (simduid.5); > > > > > > > D.4587 = 0; > > > > > > > i.0 = 0; > > > > > > > #pragma omp simd safelen(32) > > > > > > > _simduid_(simduid.5) > > > > > > _simt_ linear(i.0:1) linear(i:1) > > > > > > > for (i.0 = 0; i.0 < 1000; i.0 = i.0 + 1) > > > > > > > ... > > > > > > > } > > > > > > > goto <D.4579>; > > > > > > > <D.4578>: > > > > > > > { > > > > > > > unsigned int D.4603; > > > > > > > unsigned int D.4604; > > > > > > > int D.4605[0:POLY_INT_CST [15, 16]]; > > > > > > > void * simduid.7; > > > > > > > unsigned int D.4612; > > > > > > > int * D.4613; > > > > > > > int D.4614; > > > > > > > int i [value-expr: D.4605[D.4604]]; > > > > > > > int i.0; > > > > > > > > > > > > > > D.4604 = 0; > > > > > > > i.0 = 0; > > > > > > > #pragma omp simd safelen(POLY_INT_CST [16, > > > > 16]) > > > > > > _simduid_(simduid.7) linear(i.0:1) linear(i:1) > > > > > > > ... > > > > > > > } > > > > > > > <D.4579>: > > > > > > > ... > > > > > > > > > > > > > > For offloading to SIMT based device like nvptx, > > scan_omp_simd > > > > > > > duplicates lowering of simd pragma into if-else where the > > if- > > > > part > > > > > > > contains simt code-path, and else-part contains simd code- > > path. > > > > In > > > > > > lower_rec_simd_input_clauses, max_vf is set to 16+16x for the > > > > above > > > > > > case as determined by omp_max_vf, and that becomes length of > > the > > > > omp > > > > > > simd > > > > > > array: > > > > > > > int D.4605[0:POLY_INT_CST [15, 16]]; > > > > > > > > > > > > > > The issue here is that, the function containing above if- > > else > > > > > > > condition gets streamed out to LTO bytecode including the > > simd > > > > > > > code- > > > > > > path and the omp simd array, whose domain is > > [0:POLY_INT_CST[15, > > > > > > 16]], and thus we get the above error while streaming-in > > > > > > POLY_INT_CST in lto_input_ts_poly_tree_pointers on device > > side. > > > > > > > > > > > > > > Note that, the simd code-path is essentially dead-code on > > > > > > > nvptx, since > > > > > > > .GOMP_USE_SIMT() resolves to 1 during omp_device_lower pass, > > > > > > > and later optimization passes (ccp2) remove the dead-code > > path > > > > > > > and unused omp > > > > > > simd arrays while compiling to device. So in this case, we > > > > > > aren't really mapping POLY_INT_CST from host to device, but it > > > > > > gets streamed out to device as an artefact of omp simd > > lowering. > > > > > > > > > > > > > > I suppose a proper fix here would be to (somehow) defer > > > > > > > lowering of omp pragma simd after streaming out to device, > > so > > > > > > > the device only sees simt code-path, and the host only sees > > > > > > > simd code path > > > > ? > > > > > > > Or perhaps > > > > > > clone each function in offload region, one for host and one > > for > > > > SIMT > > > > > > device, and only stream the device versions to avoid streaming > > > > > > out > > > > > > host- specific IR changes ? > > > > > > > > > > > > There is currently no way to have the host compiler query > > > > > > offload target capabilities so the only true fix is to delay > > OMP > > > > > > SIMD lowering to the target. > > > > > Um, I thought we could use omp_max_simt_vf from host to query if > > > > > the > > > > offload target is SIMT ? > > > > > The function essentially iterates thru OFFLOAD_TARGET_NAMES and > > > > returns non-zero for nvptx. > > > > > > > > > > > > Are we dependent on the early optimization pipeline being run > > on > > > > the > > > > > > host to produce the offload IL? There's some oddball OACC > > > > > > passes > > > > in > > > > > > pass_ipa_oacc. > > > > > > > > > > > > That said, I'd probably try to produce clones with unlowered > > IL > > > > and > > > > > > skip those clones from all processing from that point and > > resume > > > > in > > > > > > the offload compiler. > > > > > > > > > > > > > I thought of following approaches as workarounds: > > > > > > > > > > > > I don't think any workaround will fly in the end. Can't you > > > > simply > > > > > > force SVE to be off for offload clones on the host side and > > > > > > force OMP lowering with ADVSIMD only? > > > > > Would it be correct to set: > > > > > sctx.max_vf = constant_lower_bound (omp_max_vf ()) > > > > > > > > > > if function is offloaded and omp_max_vf returns non-constant > > > > poly_int, > > > > > to force max_vf to be VLS, which will avoid VLA vectorization as > > > > > in > > > > the attached patch ? > > > > > > > > > > Or should we modify autovectorize_vector_modes hook to return > > VLS > > > > > modes for offloaded functions ? > > > > > > > > Can't you simply put a target(march=armv8.3a) (or even more basic > > > > ISA) on the OMP target clones? Or is the lowering happening > > before > > > > outlining? > > > AFAIU, scan_omp_target in omp-lower creates a "child function" but > > doesn't outline it. > > > The actual outlining seems to be happening during omp-expand pass. > > > > > > > In that case maybe we want to switch the host target into > > "offload- > > > > lowering-mode" (effectively switching to a basic ISA)? > > > Sorry, I didn't understand -- Do you mean we should enforce basic > > ISA for all the functions including host-only ones ? > > > > No, not for host-only. > > > > > Since the issue stems from having host details leaked into IL during > > > omp-lowering (and omp-expand), I was wondering if we should use > > > default versions of the hooks if offloading is enabled in omp-lower > > and omp-expand, which should avoid host details creeping into offload > > IL, sth like in the attached patch ? > > > > Ick. No. I think we should avoid lowering parts applied when the > > pieces are not outlined yet and ensure those offload functions have a > > "basic ISA" (which the target needs to define) - the target already > > has control over the function decl built (IIRC) so it can attach > > required target attributes (and we can ensure those are dropped on the > > offload compiler side again). > Hi Richard, > Thanks for the suggestions! > > From what I understand, omp expansion proceeds in depth-first order, by > expanding innermost regions first > and then progressively outlining them. I tried disabling SIMD lowering during > omp-lowering but that triggered > several errors during omp-expand, and didn't attempt to pursue it further. > > I briefly experimented with adding a new target hook offload_mode, that will > set host ISA to "offload mode", > for current_function_decl (function containing offload region) when starting > to lower offload region, and reset the host ISA on > current_function_decl after leaving it (essentially enabling/disabling > TARGET_FL_SVE in isa_flags in AArch64 backend), but > not sure if that was entirely correct, and dropped it. > > > > The alternative would be to go with something along your patch in this > > mail but do > > > > poly_uint64 > > -omp_max_vf (void) > > +omp_max_vf (bool offload) > > { > > ... > > if (offload) > > get_me_magic_value_for_configured_offload_device () > > > > but iff we're doing the lowering before outlining then if we have > > configured two different offload devices we'll have to come up with a > > max_vf that's suitable for all offload devices? I think this shows a > > least this part of the lowering is done in the wrong place? > > > > Maybe we can have the max_vf "symbolic"? Like .IFN_MAX_VF () and > > lower that on the offload side only? > AFAIU, there are three things that depend on max_vf: > (1) Setting loop->safelen > (2) Setting length of omp simd arrays > (3) Computing chunk_size for schedule clause (with simd modifier) > > We can't use result of an internal function for loop->safelen since it's a > compile-time artefact.
True. > And for array length, (at-least) sra pass seems to assume that > TYPE_MIN/TYPE_MAX are INTEGER_CST. > From prepare_iteration_over_array_elts: > > tree minidx = TYPE_MIN_VALUE (TYPE_DOMAIN (type)); > gcc_assert (TREE_CODE (minidx) == INTEGER_CST); > tree maxidx = TYPE_MAX_VALUE (TYPE_DOMAIN (type)); > /* Skip (some) zero-length arrays; others have MAXIDX == MINIDX - 1. */ > if (!maxidx) > return false; > gcc_assert (TREE_CODE (maxidx) == INTEGER_CST); The issue might be that local vars cannot be "VLA", instead they would be lowered to be allocated by alloca(). That means the assertion is technically correct. > The attached patch: > (a) Uses a placeholder value (INT_MAX) for max_vf which gets assigned to > loop->safelen and length of omp simd array if offloading > is enabled. I think it's better to identify the loop that needs "max_vf lowering" with a new flag and set a conservative value to max_vf as it could be interpreted and used for invalid optimization otherwise. For the SIMD array size I have no good suggestions - the uses are very constrained though, so I suspect any magic value that's not 0 or 1 might work. > (b) For computing chunk_size, using INT_MAX resulted in chunk_size being > constant propagated (and harder to recover later), > so I added a new internal function .GOMP_SIMD_MAX_VF, whose result is > assigned to chunk_size. I guess that's OK. > (c) Adds a new pass pass_omp_adjust_max_vf, just before omp_device_lower, to > adjust above 3 artefacts to correctly adjust max_vf, > and fold away .GOMP_SIMD_MAX_VF (or should I piggy back it on some other > pass?) It does look related enough to pass_omp_device_lower, no? I hope Jakub can chime in a bit. > For the following contrived test: > > #include <stdlib.h> > > #define N 1000 > int A[N]; > int B[N]; > > int main() > { > int i; > int sum = 0; > > #pragma omp target map(sum), map(A), map(B), map (i) > #pragma omp teams distribute parallel for simd reduction(+:sum) > schedule(simd:static, 5) > for (i = 0; i < N; i++) > sum += A[i] * B[i]; > return sum; > } > > With patch, omp expand dump shows length of omp simd arrays set to INT_MAX > with offloading > enabled: > int D.5382[2147483647]; > int D.5378[2147483647]; > > and following computation for chunk_size: > D.5353 = .GOMP_SIMD_MAX_VF (); > D.5354 = D.5353 + 4; > D.5355 = .GOMP_SIMD_MAX_VF (); > D.5356 = -D.5355; > D.5357 = D.5354 & D.5356; > D.5358 = D.5348 - D.5347; > > And after omp_adjust_max_vf pass, the dump shows correct max_vf assigned to > length of > omp simd array and chunk_size on host side: > > int D.5382[0:POLY_INT_CST [15, 16]]; > int D.5378[0:POLY_INT_CST [15, 16]]; > ... > _38 = POLY_INT_CST [16, 16]; > _39 = _38 + 4; > _40 = POLY_INT_CST [16, 16]; > _41 = -_40; > _42 = _39 & _41; > _43 = _35 - _34; > > and would make similar adjustments for SIMD based devices. > For SIMT devices, the patch explicitly sets max_vf to 1, to fold > .GOMP_SIMD_MAX_VF and shrink omp simd array > (altho I guess it doesn't really matter since the simd code-path would be > dead-code?) > > I had a couple of questions: > > (1) With patch, I am seeing a lot of errors -- "multiple dump files found" > For eg: > libgomp.c++/../libgomp.c-c++-common/target-is-initial-host-2.c: multiple dump > files found > UNRESOLVED: libgomp.c++/../libgomp.c-c++-common/target-is-initial-host-2.c > scan-nvptx-none-offload-tree-dump-times optimized "omp_is_initial_device" 1 > > The error seems to come from scandump.exp:glob-dump-file: > > if { $num_files > 1 } { > verbose -log "$testcase: multiple dump files found" > } > > This seems to happen because the compiler is passed: -fdump-tree-optimized > -foffload-options=-fdump-tree-optimized, which results > in two optimized dump files. I am not sure tho, why the patch specifically > triggers this ? I think the general scan-tree-dump* do not work here, IIRC there was work to add offload and lto dump scan variants. > (2) To check if we're in offload region in omp_adjust_chunk_size, the patch > defines a new function enclosing_target_region_p, > and keeps walking region->outer till it reaches GIMPLE_OMP_TARGET region (or > NULL). While it seems to work, I was wondering if there was a better > way to do this ? Another option would be to keep a static variable > target_nesting_level, which is incremented/decremented > before/after each call to omp_expand_target, similar to one in omp-lower.cc ? > I tried using omp_maybe_offloaded from omp_adjust_chunk_size, but that > didn't seem to work. > > Does the patch look in the right direction ? I think yes, but I lack the overall OMP lowering picture here (and still think we should eventually delay offload target lowering to the offload compile and only outline regions on the host) Richard. > Thanks, > Prathamesh > > > > Richard. > > > > > > > Thanks, > > > Prathamesh > > > > > > > > > Thanks, > > > > > Prathamesh > > > > > > > > > > > > Richard. > > > > > > > > > > > > > [1] Set sctx.max_vf to constant_lower_bound(omp_max_vf ()) > > in > > > > > > > lower_rec_simd_input_clauses, if the function is going to be > > > > > > > offloaded and omp_max_vf returns non-constant poly_int. For > > > > above > > > > > > > case, it sets > > > > > > max_vf to 16 instead of 16+16x which seems to resolve the > > issue, > > > > but > > > > > > it'd use suboptimal max VF for host ? This is done in patch > > > > > > p-283- > > > > 2.txt. > > > > > > > > > > > > > > However, with clean trunk it still seems to use max_vf = 16 > > > > after > > > > > > disabling the above error. > > > > > > > vect dump shows: > > > > > > > > > > > > > > (compute_affine_dependence > > > > > > > ref_a: (*_25)[i.0_51], stmt_a: _26 = (*_25)[i.0_51]; > > > > > > > ref_b: (*_23)[i.0_51], stmt_b: (*_23)[i.0_51] = _27; > > > > > > > ) -> dependence analysis failed > > > > > > > foo.c:10:13: note: dependence distance = 0. > > > > > > > foo.c:10:13: note: dependence distance == 0 between > > > > (*_23)[i.0_51] > > > > > > and (*_23)[i.0_51] > > > > > > > foo.c:10:13: missed: bad data dependence. > > > > > > > foo.c:10:13: note: ***** Analysis failed with vector mode > > > > VNx4SI > > > > > > > > > > > > > > This seems to happen because, loop->safelen is set to 16 by > > > > taking > > > > > > > MIN(constant_lower_bound(16+16x), INT_MAX) in > > expand_omp_simd: > > > > > > > > > > > > > > if (!poly_int_tree_p (safelen, &val)) > > > > > > > safelen_int = 0; > > > > > > > else > > > > > > > safelen_int = MIN (constant_lower_bound (val), > > > > > > > INT_MAX); > > > > > > > > > > > > > > and fails to vectorize with VLA vectors, because max_vf == > > 16 > > > > and > > > > > > min_vf == 4+4x resulting in bad data dependence due to: > > > > > > > > > > > > > > if (max_vf != MAX_VECTORIZATION_FACTOR > > > > > > > && maybe_lt (max_vf, min_vf)) > > > > > > > return opt_result::failure_at (vect_location, "bad data > > > > > > > dependence.\n"); > > > > > > > > > > > > > > If safelen was (somehow) set to 16+16x, I guess it could > > have > > > > used > > > > > > VF=4+4x and vectorized with VLA vectors. > > > > > > > but I suppose that's a separate issue ? > > > > > > > > > > > > > > [2] Since the issue seems to be only with streaming out > > length > > > > of > > > > > > > omp simd array when it's POLY_INT_CST, could we perhaps use > > a > > > > > > > place holder length during omp lowering and compute the > > > > > > > correct length after streaming out, so POLY_INT_CST doesn't > > > > > > > get leaked into bytecode ? The > > > > > > attached patch p-283-3.txt follows this approach by using > > bogus > > > > > > length INT_MAX in lower_rec_simd_input_clauses if offloading > > to > > > > SIMT > > > > > > device and max_vf is non-constant poly_int, and later > > computing > > > > the > > > > > > correct length in beginning of vect pass by setting it to > > > > omp_max_vf > > > > > > (), but I am not sure if this is entirely correct. > > > > > > > I am assuming that creating omp simd array of bogus length > > > > > > > will not be an issue for nvptx since it will never get > > > > > > > referenced and eventually be removed by remove_unused_locals > > ? > > > > > > > If it'd not be a good idea to > > > > > > rely on the pass pipeline to eliminate simd code-path and omp > > > > > > simd array while compiling to device, it could be possibly > > done > > > > > > during omp_lower_device pass itself ? > > > > > > > > > > > > > > [3] While streaming-in POLY_INT_CST, avoid emitting error > > > > > > > immediately if degree of POLY_INT_CST exceeds accel's > > > > > > > NUM_POLY_INT_COEFFS to ignore POLY_INT_CSTs that may > > > > > > > potentially occur on dead-code path, and instead mark it as > > error_mark_node. > > > > > > > For the above case, since > > > > > > POLY_INT_CST appears on dead-code path, streaming POLY_INT_CST > > > > with > > > > > > higher degree than accel's NUM_POLY_INT_COEFFS would be > > > > "harmless". > > > > > > And detect invalid POLY_INT_CST's in expand pass (if it > > survives > > > > > > till this point), and emit above error, but not sure if that'd > > > > > > be the right place ? > > > > > > > This is done in p-283-4.txt. > > > > > > > > > > > > > > All the three patches fix UNRESOLVED tests due to > > POLY_INT_CST > > > > > > streaming error in libgomp testsuite with -mcpu=generic+sve2. > > > > > > > (Altho it introduces a strange FAIL for data-5.f90, which I > > am > > > > > > investigating). > > > > > > > I would be grateful for suggestions on how to proceed. > > > > > > > > > > > > > > Signed-off-by: Prathamesh Kulkarni <prathame...@nvidia.com> > > > > > > > > > > > > > > Thanks, > > > > > > > Prathamesh > > > > > > > > > > > > > > > > > > > -- > > > > > > Richard Biener <rguent...@suse.de> SUSE Software Solutions > > > > > > Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; > > > > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, > > AG > > > > > > Nuernberg) > > > > > > > > > > > > > -- > > > > Richard Biener <rguent...@suse.de> > > > > SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 > > > > Nuernberg, Germany; > > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG > > > > Nuernberg) > > > > > > > -- > > Richard Biener <rguent...@suse.de> > > SUSE Software Solutions Germany GmbH, > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG > > Nuernberg)