RE: [RFC] Enabling SVE with offloading to nvptx

Prathamesh Kulkarni via Gcc Thu, 17 Oct 2024 06:22:34 -0700

> -----Original Message-----
> From: Richard Biener <[email protected]>
> Sent: 16 October 2024 13:05
> To: Prathamesh Kulkarni <[email protected]>
> Cc: [email protected]; Thomas Schwinge <[email protected]>
> Subject: Re: [RFC] Enabling SVE with offloading to nvptx
> 
> External email: Use caution opening links or attachments
> 
> 
> On Tue, 15 Oct 2024, Prathamesh Kulkarni wrote:
> 
> > Hi,
> > Testing libgomp with SVE enabled (-mcpu=generic+sve2), results in ~60
> UNRESOLVED errors with following error message:
> >
> > lto1: fatal error: degree of 'poly_int' exceeds 'NUM_POLY_INT_COEFFS'
> > compilation terminated.
> > nvptx mkoffload: fatal error:
> > ../../install/bin/aarch64-unknown-linux-gnu-accel-nvptx-none-gcc
> returned 1 exit status compilation terminated.
> >
> > This behaviour can be reproduced with the following simple test-case
> with -fopenmp -foffload=nvptx-none -mcpu=generic+sve2:
> >
> > #define N 1000
> > int main ()
> > {
> >   int i;
> >   int A[N] = {0}, B[N] = {0};
> >
> >   #pragma omp target map(i), map(tofrom: A), map(from: B)
> >   #pragma omp simd
> >   for (i = 0; i < N; i++)
> >     A[i] = A[i] + B[i];
> >   return A[0];
> > }
> >
> > omplower pass lowers the above loop to the following:
> >
> >                 D.4576 = .GOMP_USE_SIMT ();
> >                 if (D.4576 != 0) goto <D.4577>; else goto <D.4578>;
> >                 <D.4577>:
> >                 {
> >                   unsigned int D.4586;
> >                   unsigned int D.4587;
> >                   int D.4588;
> >                   void * simduid.5;
> >                   void * .omp_simt.6;
> >                   int D.4596;
> >                   _Bool D.4597;
> >                   int D.4598;
> >                   unsigned int D.4599;
> >                   int D.4600;
> >                   int D.4601;
> >                   int * D.4602;
> >                   int i [value-expr: D.4588];
> >                   int i.0;
> >
> >                   simduid.5 = .GOMP_SIMT_ENTER (simduid.5, &D.4588);
> >                   .omp_simt.6 = .GOMP_SIMT_ENTER_ALLOC (simduid.5);
> >                   D.4587 = 0;
> >                   i.0 = 0;
> >                   #pragma omp simd safelen(32) _simduid_(simduid.5)
> _simt_ linear(i.0:1) linear(i:1)
> >                   for (i.0 = 0; i.0 < 1000; i.0 = i.0 + 1)
> >                   ...
> >                 }
> >                 goto <D.4579>;
> >                 <D.4578>:
> >                 {
> >                   unsigned int D.4603;
> >                   unsigned int D.4604;
> >                   int D.4605[0:POLY_INT_CST [15, 16]];
> >                   void * simduid.7;
> >                   unsigned int D.4612;
> >                   int * D.4613;
> >                   int D.4614;
> >                   int i [value-expr: D.4605[D.4604]];
> >                   int i.0;
> >
> >                   D.4604 = 0;
> >                   i.0 = 0;
> >                   #pragma omp simd safelen(POLY_INT_CST [16, 16])
> _simduid_(simduid.7) linear(i.0:1) linear(i:1)
> >                   ...
> >                  }
> >                  <D.4579>:
> >                  ...
> >
> > For offloading to SIMT based device like nvptx, scan_omp_simd
> > duplicates lowering of simd pragma into if-else where the if-part
> > contains simt code-path, and else-part contains simd code-path. In
> lower_rec_simd_input_clauses, max_vf is set to 16+16x for the above case
> as determined by omp_max_vf, and that becomes length of the omp simd
> array:
> > int D.4605[0:POLY_INT_CST [15, 16]];
> >
> > The issue here is that, the function containing above if-else
> > condition gets streamed out to LTO bytecode including the simd code-
> path and the omp simd array, whose domain is [0:POLY_INT_CST[15, 16]],
> and thus we get the above error while streaming-in POLY_INT_CST in
> lto_input_ts_poly_tree_pointers on device side.
> >
> > Note that, the simd code-path is essentially dead-code on nvptx, since
> > .GOMP_USE_SIMT() resolves to 1 during omp_device_lower pass, and later
> > optimization passes (ccp2) remove the dead-code path and unused omp
> simd arrays while compiling to device. So in this case, we aren't really
> mapping POLY_INT_CST from host to device, but it gets streamed out to
> device as an artefact of omp simd lowering.
> >
> > I suppose a proper fix here would be to (somehow) defer lowering of
> > omp pragma simd after streaming out to device, so the device only sees
> > simt code-path, and the host only sees simd code path ? Or perhaps
> clone each function in offload region, one for host and one for SIMT
> device, and only stream the device versions to avoid streaming out host-
> specific IR changes ?
> 
> There is currently no way to have the host compiler query offload target
> capabilities so the only true fix is to delay OMP SIMD lowering to the
> target.
Um, I thought we could use omp_max_simt_vf from host to query if the offload 
target is SIMT ?
The function essentially iterates thru OFFLOAD_TARGET_NAMES and returns 
non-zero for nvptx.
> 
> Are we dependent on the early optimization pipeline being run on the
> host to produce the offload IL?  There's some oddball OACC passes in
> pass_ipa_oacc.
> 
> That said, I'd probably try to produce clones with unlowered IL and skip
> those clones from all processing from that point and resume in the
> offload compiler.
> 
> > I thought of following approaches as workarounds:
> 
> I don't think any workaround will fly in the end.  Can't you simply
> force SVE to be off for offload clones on the host side and force OMP
> lowering with ADVSIMD only?
Would it be correct to set:
sctx.max_vf = constant_lower_bound (omp_max_vf ())


if function is offloaded and omp_max_vf returns non-constant poly_int,
to force max_vf to be VLS, which will avoid VLA vectorization as in the 
attached patch ?

Or should we modify autovectorize_vector_modes hook to return VLS modes for 
offloaded functions ?

Thanks,
Prathamesh
> 
> Richard.
> 
> > [1] Set sctx.max_vf to constant_lower_bound(omp_max_vf ()) in
> > lower_rec_simd_input_clauses, if the function is going to be offloaded
> > and omp_max_vf returns non-constant poly_int. For above case, it sets
> max_vf to 16 instead of 16+16x which seems to resolve the issue, but
> it'd use suboptimal max VF for host ? This is done in patch p-283-2.txt.
> >
> > However, with clean trunk it still seems to use max_vf = 16 after
> disabling the above error.
> > vect dump shows:
> >
> > (compute_affine_dependence
> >   ref_a: (*_25)[i.0_51], stmt_a: _26 = (*_25)[i.0_51];
> >   ref_b: (*_23)[i.0_51], stmt_b: (*_23)[i.0_51] = _27;
> > ) -> dependence analysis failed
> > foo.c:10:13: note:   dependence distance  = 0.
> > foo.c:10:13: note:   dependence distance == 0 between (*_23)[i.0_51]
> and (*_23)[i.0_51]
> > foo.c:10:13: missed:  bad data dependence.
> > foo.c:10:13: note:  ***** Analysis failed with vector mode VNx4SI
> >
> > This seems to happen because, loop->safelen is set to 16 by taking
> > MIN(constant_lower_bound(16+16x), INT_MAX) in expand_omp_simd:
> >
> >       if (!poly_int_tree_p (safelen, &val))
> >         safelen_int = 0;
> >       else
> >         safelen_int = MIN (constant_lower_bound (val), INT_MAX);
> >
> > and fails to vectorize with VLA vectors, because max_vf == 16 and
> min_vf == 4+4x resulting in bad data dependence due to:
> >
> >   if (max_vf != MAX_VECTORIZATION_FACTOR
> >       && maybe_lt (max_vf, min_vf))
> >     return opt_result::failure_at (vect_location, "bad data
> > dependence.\n");
> >
> > If safelen was (somehow) set to 16+16x, I guess it could have used
> VF=4+4x and vectorized with VLA vectors.
> > but I suppose that's a separate issue ?
> >
> > [2] Since the issue seems to be only with streaming out length of omp
> > simd array when it's POLY_INT_CST, could we perhaps use a place holder
> > length during omp lowering and compute the correct length after
> > streaming out, so POLY_INT_CST doesn't get leaked into bytecode ? The
> attached patch p-283-3.txt follows this approach by using bogus length
> INT_MAX in lower_rec_simd_input_clauses if offloading to SIMT device and
> max_vf is non-constant poly_int, and later computing the correct length
> in beginning of vect pass by setting it to omp_max_vf (), but I am not
> sure if this is entirely correct.
> > I am assuming that creating omp simd array of bogus length will not be
> > an issue for nvptx since it will never get referenced and eventually
> > be removed by remove_unused_locals ? If it'd not be a good idea to
> rely on the pass pipeline to eliminate simd code-path and omp simd array
> while compiling to device, it could be possibly done during
> omp_lower_device pass itself ?
> >
> > [3] While streaming-in POLY_INT_CST, avoid emitting error immediately
> > if degree of POLY_INT_CST exceeds accel's NUM_POLY_INT_COEFFS to
> > ignore POLY_INT_CSTs that may potentially occur on dead-code path, and
> > instead mark it as error_mark_node. For the above case, since
> POLY_INT_CST appears on dead-code path, streaming POLY_INT_CST with
> higher degree than accel's NUM_POLY_INT_COEFFS would be "harmless". And
> detect invalid POLY_INT_CST's in expand pass (if it survives till this
> point), and emit above error, but not sure if that'd be the right place
> ?
> > This is done in p-283-4.txt.
> >
> > All the three patches fix UNRESOLVED tests due to POLY_INT_CST
> streaming error in libgomp testsuite with -mcpu=generic+sve2.
> > (Altho it introduces a strange FAIL for data-5.f90, which I am
> investigating).
> > I would be grateful for suggestions on how to proceed.
> >
> > Signed-off-by: Prathamesh Kulkarni <[email protected]>
> >
> > Thanks,
> > Prathamesh
> >
> 
> --
> Richard Biener <[email protected]>
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> Nuernberg)

Set max_vf to constant lower bound if max_vf is poly_int and offloading to 
device. 

gcc/
        * omp-low.cc (lower_rec_simd_input_clauses): Set max_vf to
        constant_lower_bound (omp_max_vf()) if offloading is enabled and
        max_vf is POLY_INT_CST.

Signed-off-by: Prathamesh Kulkarni <[email protected]>

diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc
index da2051b0279..780ea396b7f 100644
--- a/gcc/omp-low.cc
+++ b/gcc/omp-low.cc
@@ -4588,7 +4588,20 @@ lower_rec_simd_input_clauses (tree new_var, omp_context 
*ctx,
 {
   if (known_eq (sctx->max_vf, 0U))
     {
-      sctx->max_vf = sctx->is_simt ? omp_max_simt_vf () : omp_max_vf ();
+      if (sctx->is_simt)
+       sctx->max_vf = omp_max_simt_vf ();
+      else
+       {
+         poly_uint64 max_vf = omp_max_vf ();
+         /* FIXME: If the function is going to be offloaded,
+            and max_vf is poly_int, use constant_lower_bound as safelen,
+            to avoid streaming out omp simd arrays having poly_int_cst
+            size.  */
+         if (omp_maybe_offloaded_ctx (ctx)
+             && !max_vf.is_constant ())
+           max_vf = constant_lower_bound (max_vf);
+         sctx->max_vf = max_vf;
+       }
       if (maybe_gt (sctx->max_vf, 1U))
        {
          tree c = omp_find_clause (gimple_omp_for_clauses (ctx->stmt),

RE: [RFC] Enabling SVE with offloading to nvptx

Reply via email to