On Fri, 18 Oct 2024, Prathamesh Kulkarni wrote:

> 
> 
> > -----Original Message-----
> > From: Richard Biener <rguent...@suse.de>
> > Sent: 17 October 2024 19:18
> > To: Prathamesh Kulkarni <prathame...@nvidia.com>
> > Cc: gcc@gcc.gnu.org; Thomas Schwinge <tschwi...@baylibre.com>
> > Subject: RE: [RFC] Enabling SVE with offloading to nvptx
> > 
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Thu, 17 Oct 2024, Prathamesh Kulkarni wrote:
> > 
> > > > -----Original Message-----
> > > > From: Richard Biener <rguent...@suse.de>
> > > > Sent: 16 October 2024 13:05
> > > > To: Prathamesh Kulkarni <prathame...@nvidia.com>
> > > > Cc: gcc@gcc.gnu.org; Thomas Schwinge <tschwi...@baylibre.com>
> > > > Subject: Re: [RFC] Enabling SVE with offloading to nvptx
> > > >
> > > > External email: Use caution opening links or attachments
> > > >
> > > >
> > > > On Tue, 15 Oct 2024, Prathamesh Kulkarni wrote:
> > > >
> > > > > Hi,
> > > > > Testing libgomp with SVE enabled (-mcpu=generic+sve2), results
> > in
> > > > > ~60
> > > > UNRESOLVED errors with following error message:
> > > > >
> > > > > lto1: fatal error: degree of 'poly_int' exceeds
> > 'NUM_POLY_INT_COEFFS'
> > > > > compilation terminated.
> > > > > nvptx mkoffload: fatal error:
> > > > > ../../install/bin/aarch64-unknown-linux-gnu-accel-nvptx-none-gcc
> > > > returned 1 exit status compilation terminated.
> > > > >
> > > > > This behaviour can be reproduced with the following simple
> > > > > test-case
> > > > with -fopenmp -foffload=nvptx-none -mcpu=generic+sve2:
> > > > >
> > > > > #define N 1000
> > > > > int main ()
> > > > > {
> > > > >   int i;
> > > > >   int A[N] = {0}, B[N] = {0};
> > > > >
> > > > >   #pragma omp target map(i), map(tofrom: A), map(from: B)
> > > > >   #pragma omp simd
> > > > >   for (i = 0; i < N; i++)
> > > > >     A[i] = A[i] + B[i];
> > > > >   return A[0];
> > > > > }
> > > > >
> > > > > omplower pass lowers the above loop to the following:
> > > > >
> > > > >                 D.4576 = .GOMP_USE_SIMT ();
> > > > >                 if (D.4576 != 0) goto <D.4577>; else goto
> > <D.4578>;
> > > > >                 <D.4577>:
> > > > >                 {
> > > > >                   unsigned int D.4586;
> > > > >                   unsigned int D.4587;
> > > > >                   int D.4588;
> > > > >                   void * simduid.5;
> > > > >                   void * .omp_simt.6;
> > > > >                   int D.4596;
> > > > >                   _Bool D.4597;
> > > > >                   int D.4598;
> > > > >                   unsigned int D.4599;
> > > > >                   int D.4600;
> > > > >                   int D.4601;
> > > > >                   int * D.4602;
> > > > >                   int i [value-expr: D.4588];
> > > > >                   int i.0;
> > > > >
> > > > >                   simduid.5 = .GOMP_SIMT_ENTER (simduid.5,
> > &D.4588);
> > > > >                   .omp_simt.6 = .GOMP_SIMT_ENTER_ALLOC
> > (simduid.5);
> > > > >                   D.4587 = 0;
> > > > >                   i.0 = 0;
> > > > >                   #pragma omp simd safelen(32)
> > > > > _simduid_(simduid.5)
> > > > _simt_ linear(i.0:1) linear(i:1)
> > > > >                   for (i.0 = 0; i.0 < 1000; i.0 = i.0 + 1)
> > > > >                   ...
> > > > >                 }
> > > > >                 goto <D.4579>;
> > > > >                 <D.4578>:
> > > > >                 {
> > > > >                   unsigned int D.4603;
> > > > >                   unsigned int D.4604;
> > > > >                   int D.4605[0:POLY_INT_CST [15, 16]];
> > > > >                   void * simduid.7;
> > > > >                   unsigned int D.4612;
> > > > >                   int * D.4613;
> > > > >                   int D.4614;
> > > > >                   int i [value-expr: D.4605[D.4604]];
> > > > >                   int i.0;
> > > > >
> > > > >                   D.4604 = 0;
> > > > >                   i.0 = 0;
> > > > >                   #pragma omp simd safelen(POLY_INT_CST [16,
> > 16])
> > > > _simduid_(simduid.7) linear(i.0:1) linear(i:1)
> > > > >                   ...
> > > > >                  }
> > > > >                  <D.4579>:
> > > > >                  ...
> > > > >
> > > > > For offloading to SIMT based device like nvptx, scan_omp_simd
> > > > > duplicates lowering of simd pragma into if-else where the if-
> > part
> > > > > contains simt code-path, and else-part contains simd code-path.
> > In
> > > > lower_rec_simd_input_clauses, max_vf is set to 16+16x for the
> > above
> > > > case as determined by omp_max_vf, and that becomes length of the
> > omp
> > > > simd
> > > > array:
> > > > > int D.4605[0:POLY_INT_CST [15, 16]];
> > > > >
> > > > > The issue here is that, the function containing above if-else
> > > > > condition gets streamed out to LTO bytecode including the simd
> > > > > code-
> > > > path and the omp simd array, whose domain is [0:POLY_INT_CST[15,
> > > > 16]], and thus we get the above error while streaming-in
> > > > POLY_INT_CST in lto_input_ts_poly_tree_pointers on device side.
> > > > >
> > > > > Note that, the simd code-path is essentially dead-code on nvptx,
> > > > > since
> > > > > .GOMP_USE_SIMT() resolves to 1 during omp_device_lower pass, and
> > > > > later optimization passes (ccp2) remove the dead-code path and
> > > > > unused omp
> > > > simd arrays while compiling to device. So in this case, we aren't
> > > > really mapping POLY_INT_CST from host to device, but it gets
> > > > streamed out to device as an artefact of omp simd lowering.
> > > > >
> > > > > I suppose a proper fix here would be to (somehow) defer lowering
> > > > > of omp pragma simd after streaming out to device, so the device
> > > > > only sees simt code-path, and the host only sees simd code path
> > ?
> > > > > Or perhaps
> > > > clone each function in offload region, one for host and one for
> > SIMT
> > > > device, and only stream the device versions to avoid streaming out
> > > > host- specific IR changes ?
> > > >
> > > > There is currently no way to have the host compiler query offload
> > > > target capabilities so the only true fix is to delay OMP SIMD
> > > > lowering to the target.
> > > Um, I thought we could use omp_max_simt_vf from host to query if the
> > offload target is SIMT ?
> > > The function essentially iterates thru OFFLOAD_TARGET_NAMES and
> > returns non-zero for nvptx.
> > > >
> > > > Are we dependent on the early optimization pipeline being run on
> > the
> > > > host to produce the offload IL?  There's some oddball OACC passes
> > in
> > > > pass_ipa_oacc.
> > > >
> > > > That said, I'd probably try to produce clones with unlowered IL
> > and
> > > > skip those clones from all processing from that point and resume
> > in
> > > > the offload compiler.
> > > >
> > > > > I thought of following approaches as workarounds:
> > > >
> > > > I don't think any workaround will fly in the end.  Can't you
> > simply
> > > > force SVE to be off for offload clones on the host side and force
> > > > OMP lowering with ADVSIMD only?
> > > Would it be correct to set:
> > > sctx.max_vf = constant_lower_bound (omp_max_vf ())
> > >
> > > if function is offloaded and omp_max_vf returns non-constant
> > poly_int,
> > > to force max_vf to be VLS, which will avoid VLA vectorization as in
> > the attached patch ?
> > >
> > > Or should we modify autovectorize_vector_modes hook to return VLS
> > > modes for offloaded functions ?
> > 
> > Can't you simply put a target(march=armv8.3a) (or even more basic ISA)
> > on the OMP target clones?  Or is the lowering happening before
> > outlining?
> AFAIU, scan_omp_target in omp-lower creates a "child function" but doesn't 
> outline it.
> The actual outlining seems to be happening during omp-expand pass.
> 
> > In that case maybe we want to switch the host target into "offload-
> > lowering-mode" (effectively switching to a basic ISA)?
> Sorry, I didn't understand -- Do you mean we should enforce basic ISA for all 
> the functions including host-only ones ?

No, not for host-only.

> Since the issue stems from having host details leaked into IL during 
> omp-lowering (and omp-expand), I was wondering if
> we should use default versions of the hooks if offloading is enabled in 
> omp-lower and omp-expand, which should avoid host details
> creeping into offload IL, sth like in the attached patch ?

Ick.  No.  I think we should avoid lowering parts applied when the
pieces are not outlined yet and ensure those offload functions have
a "basic ISA" (which the target needs to define) - the target already
has control over the function decl built (IIRC) so it can attach
required target attributes (and we can ensure those are dropped on
the offload compiler side again).

The alternative would be to go with something along your patch in
this mail but do

 poly_uint64
-omp_max_vf (void)
+omp_max_vf (bool offload)
 {   
...
  if (offload)
    get_me_magic_value_for_configured_offload_device ()

but iff we're doing the lowering before outlining then if we have
configured two different offload devices we'll have to come up
with a max_vf that's suitable for all offload devices?  I think
this shows a least this part of the lowering is done in the wrong
place?

Maybe we can have the max_vf "symbolic"?  Like .IFN_MAX_VF ()
and lower that on the offload side only?

Richard.


> Thanks,
> Prathamesh
> > 
> > > Thanks,
> > > Prathamesh
> > > >
> > > > Richard.
> > > >
> > > > > [1] Set sctx.max_vf to constant_lower_bound(omp_max_vf ()) in
> > > > > lower_rec_simd_input_clauses, if the function is going to be
> > > > > offloaded and omp_max_vf returns non-constant poly_int. For
> > above
> > > > > case, it sets
> > > > max_vf to 16 instead of 16+16x which seems to resolve the issue,
> > but
> > > > it'd use suboptimal max VF for host ? This is done in patch p-283-
> > 2.txt.
> > > > >
> > > > > However, with clean trunk it still seems to use max_vf = 16
> > after
> > > > disabling the above error.
> > > > > vect dump shows:
> > > > >
> > > > > (compute_affine_dependence
> > > > >   ref_a: (*_25)[i.0_51], stmt_a: _26 = (*_25)[i.0_51];
> > > > >   ref_b: (*_23)[i.0_51], stmt_b: (*_23)[i.0_51] = _27;
> > > > > ) -> dependence analysis failed
> > > > > foo.c:10:13: note:   dependence distance  = 0.
> > > > > foo.c:10:13: note:   dependence distance == 0 between
> > (*_23)[i.0_51]
> > > > and (*_23)[i.0_51]
> > > > > foo.c:10:13: missed:  bad data dependence.
> > > > > foo.c:10:13: note:  ***** Analysis failed with vector mode
> > VNx4SI
> > > > >
> > > > > This seems to happen because, loop->safelen is set to 16 by
> > taking
> > > > > MIN(constant_lower_bound(16+16x), INT_MAX) in expand_omp_simd:
> > > > >
> > > > >       if (!poly_int_tree_p (safelen, &val))
> > > > >         safelen_int = 0;
> > > > >       else
> > > > >         safelen_int = MIN (constant_lower_bound (val), INT_MAX);
> > > > >
> > > > > and fails to vectorize with VLA vectors, because max_vf == 16
> > and
> > > > min_vf == 4+4x resulting in bad data dependence due to:
> > > > >
> > > > >   if (max_vf != MAX_VECTORIZATION_FACTOR
> > > > >       && maybe_lt (max_vf, min_vf))
> > > > >     return opt_result::failure_at (vect_location, "bad data
> > > > > dependence.\n");
> > > > >
> > > > > If safelen was (somehow) set to 16+16x, I guess it could have
> > used
> > > > VF=4+4x and vectorized with VLA vectors.
> > > > > but I suppose that's a separate issue ?
> > > > >
> > > > > [2] Since the issue seems to be only with streaming out length
> > of
> > > > > omp simd array when it's POLY_INT_CST, could we perhaps use a
> > > > > place holder length during omp lowering and compute the correct
> > > > > length after streaming out, so POLY_INT_CST doesn't get leaked
> > > > > into bytecode ? The
> > > > attached patch p-283-3.txt follows this approach by using bogus
> > > > length INT_MAX in lower_rec_simd_input_clauses if offloading to
> > SIMT
> > > > device and max_vf is non-constant poly_int, and later computing
> > the
> > > > correct length in beginning of vect pass by setting it to
> > omp_max_vf
> > > > (), but I am not sure if this is entirely correct.
> > > > > I am assuming that creating omp simd array of bogus length will
> > > > > not be an issue for nvptx since it will never get referenced and
> > > > > eventually be removed by remove_unused_locals ? If it'd not be a
> > > > > good idea to
> > > > rely on the pass pipeline to eliminate simd code-path and omp simd
> > > > array while compiling to device, it could be possibly done during
> > > > omp_lower_device pass itself ?
> > > > >
> > > > > [3] While streaming-in POLY_INT_CST, avoid emitting error
> > > > > immediately if degree of POLY_INT_CST exceeds accel's
> > > > > NUM_POLY_INT_COEFFS to ignore POLY_INT_CSTs that may potentially
> > > > > occur on dead-code path, and instead mark it as error_mark_node.
> > > > > For the above case, since
> > > > POLY_INT_CST appears on dead-code path, streaming POLY_INT_CST
> > with
> > > > higher degree than accel's NUM_POLY_INT_COEFFS would be
> > "harmless".
> > > > And detect invalid POLY_INT_CST's in expand pass (if it survives
> > > > till this point), and emit above error, but not sure if that'd be
> > > > the right place ?
> > > > > This is done in p-283-4.txt.
> > > > >
> > > > > All the three patches fix UNRESOLVED tests due to POLY_INT_CST
> > > > streaming error in libgomp testsuite with -mcpu=generic+sve2.
> > > > > (Altho it introduces a strange FAIL for data-5.f90, which I am
> > > > investigating).
> > > > > I would be grateful for suggestions on how to proceed.
> > > > >
> > > > > Signed-off-by: Prathamesh Kulkarni <prathame...@nvidia.com>
> > > > >
> > > > > Thanks,
> > > > > Prathamesh
> > > > >
> > > >
> > > > --
> > > > Richard Biener <rguent...@suse.de>
> > > > SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461
> > > > Nuernberg, Germany;
> > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> > > > Nuernberg)
> > >
> > 
> > --
> > Richard Biener <rguent...@suse.de>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> > Nuernberg)
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Reply via email to