RE: [RFC] Summary of libgomp failures for offloading to nvptx from AArch64

Prathamesh Kulkarni via Gcc Mon, 29 Jul 2024 03:54:29 -0700


> -----Original Message-----
> From: Richard Biener <richard.guent...@gmail.com>
> Sent: Friday, July 26, 2024 6:51 PM
> To: Prathamesh Kulkarni <prathame...@nvidia.com>
> Cc: gcc@gcc.gnu.org
> Subject: Re: [RFC] Summary of libgomp failures for offloading to nvptx
> from AArch64
> 
> External email: Use caution opening links or attachments
> 
> 
> On Thu, Jul 25, 2024 at 3:36 PM Prathamesh Kulkarni via Gcc
> <gcc@gcc.gnu.org> wrote:
> >
> > Hi,
> > I am working on enabling offloading to nvptx from AAarch64 host. As
> > mentioned on wiki
> > (https://gcc.gnu.org/wiki/Offloading#Running_.27make_check.27),
> > I ran make check-target-libgomp on AAarch64 host (and no GPU) with
> following results:
> >
> >                 === libgomp Summary ===
> >
> > # of expected passes            14568
> > # of unexpected failures        1023
> > # of expected failures          309
> > # of untested testcases         54
> > # of unresolved testcases       992
> > # of unsupported tests          644
> >
> > It seems majority of the tests fail due to the following 4 issues:
> >
> > * Compiling a minimal test-case:
> >
> > int main()
> > {
> >   int x;
> >   #pragma omp target map (to: x)
> >   {
> >     x = 0;
> >   }
> >   return x;
> > }
> >
> > Compiling with -fopenmp -foffload=nvptx-none results in following
> issues:
> >
> > (1) Differing values of NUM_POLY_INT_COEFFS between host and
> accelerator, which results in following ICE:
> >
> > 0x1a6e0a7 pp_quoted_string
> >         ../../gcc/gcc/pretty-print.cc:2277
> >  0x1a6ffb3 pp_format(pretty_printer*, text_info*, urlifier const*)
> >         ../../gcc/gcc/pretty-print.cc:1634
> >  0x1a4a3f3 diagnostic_context::report_diagnostic(diagnostic_info*)
> >         ../../gcc/gcc/diagnostic.cc:1612
> >  0x1a4a727 diagnostic_impl
> >         ../../gcc/gcc/diagnostic.cc:1775  0x1a4e20b
> > fatal_error(unsigned int, char const*, ...)
> >         ../../gcc/gcc/diagnostic.cc:2218  0xb3088f
> > lto_input_mode_table(lto_file_decl_data*)
> >          ../../gcc/gcc/lto-streamer-in.cc:2121
> >  0x6f5cdf lto_file_finalize
> >         ../../gcc/gcc/lto/lto-common.cc:2285
> >  0x6f5cdf lto_create_files_from_ids
> >         ../../gcc/gcc/lto/lto-common.cc:2309
> >  0x6f5cdf lto_file_read
> >         ../../gcc/gcc/lto/lto-common.cc:2364
> >  0x6f5cdf read_cgraph_and_symbols(unsigned int, char const**)
> >         ../../gcc/gcc/lto/lto-common.cc:2812
> >  0x6cfb93 lto_main()
> >         ../../gcc/gcc/lto/lto.cc:658
> >
> > This is already tracked in https://gcc.gnu.org/PR96265 (and related
> > PR's)
> >
> > Streaming out mode_table:
> > mode = SI, mclass = 2, size = 4, prec = 32 mode = DI, mclass = 2,
> size
> > = 8, prec = 64
> >
> > Streaming in mode_table (in lto_input_mode_table):
> > mclass = 2, size = 4, prec = 0
> > (and then calculates the correct mode value by iterating over all
> > modes of mclass starting from narrowest mode)
> >
> > The issue is that the value for prec is not getting streamed-in
> > correctly for SImode as seen above. While streaming out from AArch64
> host, it is 32, but while streaming in for nvptx, it is 0. This
> happens because of differing values of NUM_POLY_INT_COEFFS between
> AArch64 and nvptx backend.
> >
> > Since NUM_POLY_INT_COEFFS is 2 for aarch64, the streamed-out values
> > for mode, precision would be <4, 0> and <32, 0> respectively
> > (streamed-out in bp_pack_poly_value). Both zeros come from coeffs[1]
> > of size and prec. While streaming in however, NUM_POLY_INT_COEFFS is
> 1 for nvptx, and thus it incorrectly treats <4, 0> as size and
> precision respectively, which is why precision gets streamed in as 0,
> and thus it encounters the above ICE.
> >
> > Supporting non VLA code with offloading:
> >
> > In the general case, it's hard to support offloading for arbitrary
> poly_ints when NUM_POLY_INT_COEFFS differs for host and accelerator.
> > For example, it's not possible to represent a degree-2 poly_int like
> 4 + 4x (as-is) on an accelerator with NUM_POLY_INT_COEFFS == 1.
> >
> > However, IIUC, we can support offloading for restricted set of
> > poly_ints whose degree <= accel's NUM_POLY_INT_COEFFS, since they
> can
> > be represented on accelerator ? For a hypothetical example, if host
> NUM_POLY_INT_COEFFS == 3 and accel NUM_POLY_INT_COEFFS == 2, then I
> suppose we could represent a degree 2 poly_int on accelerator, but not
> a degree 3 poly_int like 3+4x+5x^2 ?
> >
> > Based on that, I have come up with following approach in attached
> "quick-and-dirty" patch (p-163-2.diff):
> > Stream-out host NUM_POLY_INT_COEFFS, and while streaming-in during
> lto1, compare it with accelerator's NUM_POLY_INT_COEFFS as follows:
> >
> > Stream in host_num_poly_int_coeffs;
> > if (host_num_poly_int_coeffs == NUM_POLY_INT_COEFFS) //
> NUM_POLY_INT_COEFFS represents accelerator's value here.
> > {
> >     /* Both are equal, proceed to unpacking NUM_POLY_INT_COEFFS
> words
> > from bitstream.  */ } else if (host_num_poly_int_coeffs <
> > NUM_POLY_INT_COEFFS) {
> >     /* Unpack host_num_poly_int_coeffs words and zero out remaining
> > higher coeffs (similar to zero-extension).  */ } else {
> >     /* Unpack host_num_poly_int_coeffs words and ensure that degree
> of
> > streamed-out poly_int <= NUM_POLY_INT_COEFFS.  */ }
> >
> > For example, with host NUM_POLY_INT_COEFFS == 2 and accel
> > NUM_POLY_INT_COEFFS == 1, this will allow streaming of "degree-1"
> poly_ints like 4+0x (which will degenerate to constant 4), but give an
> error for streaming degree-2 poly_int like 4+4x.
> >
> > Following this approach, I am assuming we can support AArch64/nvptx
> > offloading for non VLA code, since poly_ints used for representing
> > various artefacts like mode_size, mode_precision, vector length etc.
> will be degree-1 poly_int for scalar variables and fixed-length
> vectors (and thus degenerate to constants). With the patch applied, it
> proceeds forward from this point, but fails at a later stage (see
> below).
> >
> > Does this approach look reasonable for supporting offloading for non
> VLA code ?
> > Are there any cases I may have overlooked, where offloading will
> still fail for non-VLA code due to differing NUM_POLY_INT_COEFFS issue
> ?
> 
> I think I'd change how we stream POLY_INTs and make that "independent"
> of NUM_POLY_INT_COEFFS
> in that I'd stream the effective number of coeffs required - as you
> suggest "trailing" zero coeffs do not need to be represented.  We
> always stream coeff zero.
> 
> On read in when that N is bigger than NUM_POLY_INT_COEFFS we have to
> error and we zero-fill extra elements (using the type of coeff zero).
> 
> > (2) nvptx mkoffload.cc passes -m64/-m32 to host compiler if
> > -foffload-abi=lp64/ilp32 After applying workaround for the above
> assertion failure, it hits the following error:
> > gcc: error: unrecognized command-line option '-m64'
> > nvptx mkoffload: fatal error: ../install/bin/gcc returned 1 exit
> > status compilation terminated.
> >
> > This happens because nvptx/mkoffload.cc:compile_native passes
> > -m64/-m32 to host compiler depending on whether offload_abi is
> OFFLOAD_ABI_LP64 or OFFLOAD_ABI_ILP32, and aarch64 backend doesn't
> recognize these options.
> >
> > I suppose a simple solution to check if host_compiler supports a
> > particular command-line option, would be to create a dummy C file
> and
> > check if the command "host_compiler <opt> dummy_file.c" returns zero
> exit status. Alternative could be to check exit status for
> "host_compiler <opt> --version", once http://gcc.gnu.org/PR116050 is
> fixed, but I am not sure if either is an ideal solution.
> 
> I think an explicit configure when building the offload compiler would
> be cleaner, like simply adding an additional
> AC_SUBST(host_no_multilibs) var computed from the --enable-as-
> accelerator-for target triplet?
> 
> > With workarounds for these 2 issues, the minimal test builds and
> runs successfully.
> >
> > (3) Assertion error in lto_read_decls during lto1:
> > There are several failures (~350+) in the testsuite caused due to
> the
> > following assert in lto_read_decls:
> > gcc_assert (data_in->reader_cache->nodes.length () == from + 1);
> >
> > AFAIU, this seems to happen because of presence of LTO_null tag. The
> > following workaround avoids hitting the assert, but am not sure if
> it's the right fix:
> >
> >         t = lto_input_tree_1 (&ib_main, data_in, tag, 0);
> > +       if (t == NULL_TREE)
> > +         continue;
> >         gcc_assert (data_in->reader_cache->nodes.length () == from +
> > 1);
> >
> > (FWIW, this was reproducible with the above minimal test, but has
> > seemingly gone away for it after updating the sources recently, but
> > still reproduces with libgomp tests like for-9.c, baseptrs-4.C etc.)
> 
> It looks like a discrepancy between host and target that wasn't
> properly analyzed to me.  The assert makes sure we read the expected
> number of tree nodes - what's the tag when you hit this?
> Please open a bug.
Sure, will do shortly.


I checked with for-9.c, and it shows the following state on hitting the assert:
reader_cache->nodes.length () = 2384, from+1 = 2385, tag: LTO_null

From lto_read_decls:
      unsigned from = data_in->reader_cache->nodes.length ();
      /* Read and uniquify SCCs as in the input stream.  */
      enum LTO_tags tag = streamer_read_record_start (&ib_main);
      if (tag == LTO_tree_scc || tag == LTO_trees)

In this case, tag is LTO_null, and it skips to the else part:
      t = lto_input_tree_1 (&ib_main, data_in, tag, 0);
      gcc_assert (data_in->reader_cache->nodes.length () == from + 1); 

and lto_input_tree_1 simply returns NULL_TREE if tag is LTO_null without any 
additional streaming-in.
So IIUC, data_in->reader_cache->nodes.length () won't change since there is no 
tree node streamed in lto_input_tree_1, and
thus reader_cache->length() remains equal to from (rather than from + 1), which 
triggers the assert ?
> 
> >
> > (4) AAarch64 uses OImode for 256-bit size array, which is not
> supported on nvptx:
> > This causes ~18 tests to fail.
> >
> > Can be reproduced with following simple test:
> > int main()
> > {
> >   long c[4];
> >   #pragma omp target map(c)
> >     c[0] = 0;
> >   return 0;
> > }
> >
> > Compiling with -O2 -fopenmp -foffload=nvptx-none results in:
> > lto1: fatal error: nvptx-none - 256-bit integer numbers unsupported
> > (mode 'OI') compilation terminated.
> > nvptx mkoffload: fatal error:
> > ../install/bin/aarch64-unknown-linux-gnu-accel-nvptx-none-gcc
> returned 1 exit status compilation terminated.
> >
> > This happens with AArch64 host because, it uses OImode (256-bit
> > integer mode) for ARRAY_TYPE (long c[4] fits 256-bits), which isn't
> > supported on nvptx. This decision is made during layout_type for
> 'c', which calls mode_for_array, and mode_for_array uses target hooks
> array_mode and array_mode_supported_p to determine target-specific
> modes to use for ARRAY_TYPE.
> > For x86_64, AFAIK, it uses BLKmode for ARRAY_TYPE.
> >
> > I have attached a "quick-and-dirty" patch (p-166-2.diff) which falls
> > back to using BLKmode for ARRAY_TYPE if offloading is enabled, and
> > avoids streaming-out target-specific int modes in
> > lto_write_mode_table. I used default_scalar_mode_supported_p check
> to test if the int_mode is "generic", but not sure if that's entirely
> correct. The test compiles and runs OK with patch applied. I suppose a
> more general solution would be to somehow "intersect" available
> AArch64 modes with nvptx modes, and use those for offloading ?
> 
> Hmm, I think we shouldn't stream modes for aggregates but instead let
> stor-layout re-assign them.  They are not usually expected to match.
> Not sure if we should do that generally or only for offload streaming
> though (it does have an overhead to recompute the mode).  We could
> stream VOIDmode here as indicator it needs recomputation (splitting
> out sub-workers from layout_type, there's one already for record
> type).
Thanks, this indeed sounds better. I will try to work on patch following this 
approach.

Thanks,
Prathamesh
> 
> > With local workarounds for the above 4 issues, running make check-
> target-libgomp shows following results:
> >
> >                 === libgomp Summary ===
> >
> > # of expected passes            16604
> > # of unexpected failures        10
> > # of expected failures          309
> > # of untested testcases         54
> > # of unresolved testcases       3
> > # of unsupported tests          643
> >
> > The remaining issues are:
> >
> > (5) "error: alias definitions not supported in this configuration"
> > Fails for pr96390.c, and pr96390.C. This seems to be related to
> > https://gcc.gnu.org/PR97102
> >
> > (6) Execution Failures:
> > - libgomp/pr104783.c
> > - libgomp/pr104783-2.c
> > I haven't investigated these yet.
> >
> > (7) Several warnings fail for libgomp.oacc-c-c++-common/acc_prof-
> kernels-1.c and following excess errors:
> > acc_prof-kernels-1.c:185:9: optimized: assigned OpenACC seq loop
> > parallelism
> > acc_prof-kernels-1.c:214:9: optimized: assigned OpenACC seq loop
> > parallelism
> > acc_prof-kernels-1.c:245:9: optimized: assigned OpenACC seq loop
> > parallelism
> >
> > So far, I have only been testing make check-target-libgomp. Should I
> be testing any additional parts of the testsuite for offloading
> changes ?
> >
> > My initial goals are:
> > (a) To get AArch64/nvptx offloading to work for above minimal test.
> > (b) Testsuite results for libgomp on par with x86_64 for non VLA
> code (as far as correctness is concerned).
> > (c) After (a) and (b) are in place, try to enable support for
> offloading with VLA/SVE.
> >
> > I am planning to address these issues and will post patches for the
> same shortly. I will be grateful for any feedback or suggestions on
> how to proceed forward.
> >
> > Thanks,
> > Prathamesh

RE: [RFC] Summary of libgomp failures for offloading to nvptx from AArch64

Reply via email to