Re: [RFC] Summary of libgomp failures for offloading to nvptx from AArch64

Richard Biener via Gcc Mon, 29 Jul 2024 06:49:51 -0700

On Mon, Jul 29, 2024 at 1:35 PM Prathamesh Kulkarni
<prathame...@nvidia.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Richard Biener <richard.guent...@gmail.com>
> > Sent: Friday, July 26, 2024 6:51 PM
> > To: Prathamesh Kulkarni <prathame...@nvidia.com>
> > Cc: gcc@gcc.gnu.org
> > Subject: Re: [RFC] Summary of libgomp failures for offloading to nvptx
> > from AArch64
> >
> > External email: Use caution opening links or attachments
> >
> >
> > On Thu, Jul 25, 2024 at 3:36 PM Prathamesh Kulkarni via Gcc
> > <gcc@gcc.gnu.org> wrote:
> > >
> > > Hi,
> > > I am working on enabling offloading to nvptx from AAarch64 host. As
> > > mentioned on wiki
> > > (https://gcc.gnu.org/wiki/Offloading#Running_.27make_check.27),
> > > I ran make check-target-libgomp on AAarch64 host (and no GPU) with
> > following results:
> > >
> > >                 === libgomp Summary ===
> > >
> > > # of expected passes            14568
> > > # of unexpected failures        1023
> > > # of expected failures          309
> > > # of untested testcases         54
> > > # of unresolved testcases       992
> > > # of unsupported tests          644
> > >
> > > It seems majority of the tests fail due to the following 4 issues:
> > >
> > > * Compiling a minimal test-case:
> > >
> > > int main()
> > > {
> > >   int x;
> > >   #pragma omp target map (to: x)
> > >   {
> > >     x = 0;
> > >   }
> > >   return x;
> > > }
> > >
> > > Compiling with -fopenmp -foffload=nvptx-none results in following
> > issues:
> > >
> > > (1) Differing values of NUM_POLY_INT_COEFFS between host and
> > accelerator, which results in following ICE:
> > >
> > > 0x1a6e0a7 pp_quoted_string
> > >         ../../gcc/gcc/pretty-print.cc:2277
> > >  0x1a6ffb3 pp_format(pretty_printer*, text_info*, urlifier const*)
> > >         ../../gcc/gcc/pretty-print.cc:1634
> > >  0x1a4a3f3 diagnostic_context::report_diagnostic(diagnostic_info*)
> > >         ../../gcc/gcc/diagnostic.cc:1612
> > >  0x1a4a727 diagnostic_impl
> > >         ../../gcc/gcc/diagnostic.cc:1775  0x1a4e20b
> > > fatal_error(unsigned int, char const*, ...)
> > >         ../../gcc/gcc/diagnostic.cc:2218  0xb3088f
> > > lto_input_mode_table(lto_file_decl_data*)
> > >          ../../gcc/gcc/lto-streamer-in.cc:2121
> > >  0x6f5cdf lto_file_finalize
> > >         ../../gcc/gcc/lto/lto-common.cc:2285
> > >  0x6f5cdf lto_create_files_from_ids
> > >         ../../gcc/gcc/lto/lto-common.cc:2309
> > >  0x6f5cdf lto_file_read
> > >         ../../gcc/gcc/lto/lto-common.cc:2364
> > >  0x6f5cdf read_cgraph_and_symbols(unsigned int, char const**)
> > >         ../../gcc/gcc/lto/lto-common.cc:2812
> > >  0x6cfb93 lto_main()
> > >         ../../gcc/gcc/lto/lto.cc:658
> > >
> > > This is already tracked in https://gcc.gnu.org/PR96265 (and related
> > > PR's)
> > >
> > > Streaming out mode_table:
> > > mode = SI, mclass = 2, size = 4, prec = 32 mode = DI, mclass = 2,
> > size
> > > = 8, prec = 64
> > >
> > > Streaming in mode_table (in lto_input_mode_table):
> > > mclass = 2, size = 4, prec = 0
> > > (and then calculates the correct mode value by iterating over all
> > > modes of mclass starting from narrowest mode)
> > >
> > > The issue is that the value for prec is not getting streamed-in
> > > correctly for SImode as seen above. While streaming out from AArch64
> > host, it is 32, but while streaming in for nvptx, it is 0. This
> > happens because of differing values of NUM_POLY_INT_COEFFS between
> > AArch64 and nvptx backend.
> > >
> > > Since NUM_POLY_INT_COEFFS is 2 for aarch64, the streamed-out values
> > > for mode, precision would be <4, 0> and <32, 0> respectively
> > > (streamed-out in bp_pack_poly_value). Both zeros come from coeffs[1]
> > > of size and prec. While streaming in however, NUM_POLY_INT_COEFFS is
> > 1 for nvptx, and thus it incorrectly treats <4, 0> as size and
> > precision respectively, which is why precision gets streamed in as 0,
> > and thus it encounters the above ICE.
> > >
> > > Supporting non VLA code with offloading:
> > >
> > > In the general case, it's hard to support offloading for arbitrary
> > poly_ints when NUM_POLY_INT_COEFFS differs for host and accelerator.
> > > For example, it's not possible to represent a degree-2 poly_int like
> > 4 + 4x (as-is) on an accelerator with NUM_POLY_INT_COEFFS == 1.
> > >
> > > However, IIUC, we can support offloading for restricted set of
> > > poly_ints whose degree <= accel's NUM_POLY_INT_COEFFS, since they
> > can
> > > be represented on accelerator ? For a hypothetical example, if host
> > NUM_POLY_INT_COEFFS == 3 and accel NUM_POLY_INT_COEFFS == 2, then I
> > suppose we could represent a degree 2 poly_int on accelerator, but not
> > a degree 3 poly_int like 3+4x+5x^2 ?
> > >
> > > Based on that, I have come up with following approach in attached
> > "quick-and-dirty" patch (p-163-2.diff):
> > > Stream-out host NUM_POLY_INT_COEFFS, and while streaming-in during
> > lto1, compare it with accelerator's NUM_POLY_INT_COEFFS as follows:
> > >
> > > Stream in host_num_poly_int_coeffs;
> > > if (host_num_poly_int_coeffs == NUM_POLY_INT_COEFFS) //
> > NUM_POLY_INT_COEFFS represents accelerator's value here.
> > > {
> > >     /* Both are equal, proceed to unpacking NUM_POLY_INT_COEFFS
> > words
> > > from bitstream.  */ } else if (host_num_poly_int_coeffs <
> > > NUM_POLY_INT_COEFFS) {
> > >     /* Unpack host_num_poly_int_coeffs words and zero out remaining
> > > higher coeffs (similar to zero-extension).  */ } else {
> > >     /* Unpack host_num_poly_int_coeffs words and ensure that degree
> > of
> > > streamed-out poly_int <= NUM_POLY_INT_COEFFS.  */ }
> > >
> > > For example, with host NUM_POLY_INT_COEFFS == 2 and accel
> > > NUM_POLY_INT_COEFFS == 1, this will allow streaming of "degree-1"
> > poly_ints like 4+0x (which will degenerate to constant 4), but give an
> > error for streaming degree-2 poly_int like 4+4x.
> > >
> > > Following this approach, I am assuming we can support AArch64/nvptx
> > > offloading for non VLA code, since poly_ints used for representing
> > > various artefacts like mode_size, mode_precision, vector length etc.
> > will be degree-1 poly_int for scalar variables and fixed-length
> > vectors (and thus degenerate to constants). With the patch applied, it
> > proceeds forward from this point, but fails at a later stage (see
> > below).
> > >
> > > Does this approach look reasonable for supporting offloading for non
> > VLA code ?
> > > Are there any cases I may have overlooked, where offloading will
> > still fail for non-VLA code due to differing NUM_POLY_INT_COEFFS issue
> > ?
> >
> > I think I'd change how we stream POLY_INTs and make that "independent"
> > of NUM_POLY_INT_COEFFS
> > in that I'd stream the effective number of coeffs required - as you
> > suggest "trailing" zero coeffs do not need to be represented.  We
> > always stream coeff zero.
> >
> > On read in when that N is bigger than NUM_POLY_INT_COEFFS we have to
> > error and we zero-fill extra elements (using the type of coeff zero).
> >
> > > (2) nvptx mkoffload.cc passes -m64/-m32 to host compiler if
> > > -foffload-abi=lp64/ilp32 After applying workaround for the above
> > assertion failure, it hits the following error:
> > > gcc: error: unrecognized command-line option '-m64'
> > > nvptx mkoffload: fatal error: ../install/bin/gcc returned 1 exit
> > > status compilation terminated.
> > >
> > > This happens because nvptx/mkoffload.cc:compile_native passes
> > > -m64/-m32 to host compiler depending on whether offload_abi is
> > OFFLOAD_ABI_LP64 or OFFLOAD_ABI_ILP32, and aarch64 backend doesn't
> > recognize these options.
> > >
> > > I suppose a simple solution to check if host_compiler supports a
> > > particular command-line option, would be to create a dummy C file
> > and
> > > check if the command "host_compiler <opt> dummy_file.c" returns zero
> > exit status. Alternative could be to check exit status for
> > "host_compiler <opt> --version", once http://gcc.gnu.org/PR116050 is
> > fixed, but I am not sure if either is an ideal solution.
> >
> > I think an explicit configure when building the offload compiler would
> > be cleaner, like simply adding an additional
> > AC_SUBST(host_no_multilibs) var computed from the --enable-as-
> > accelerator-for target triplet?
> >
> > > With workarounds for these 2 issues, the minimal test builds and
> > runs successfully.
> > >
> > > (3) Assertion error in lto_read_decls during lto1:
> > > There are several failures (~350+) in the testsuite caused due to
> > the
> > > following assert in lto_read_decls:
> > > gcc_assert (data_in->reader_cache->nodes.length () == from + 1);
> > >
> > > AFAIU, this seems to happen because of presence of LTO_null tag. The
> > > following workaround avoids hitting the assert, but am not sure if
> > it's the right fix:
> > >
> > >         t = lto_input_tree_1 (&ib_main, data_in, tag, 0);
> > > +       if (t == NULL_TREE)
> > > +         continue;
> > >         gcc_assert (data_in->reader_cache->nodes.length () == from +
> > > 1);
> > >
> > > (FWIW, this was reproducible with the above minimal test, but has
> > > seemingly gone away for it after updating the sources recently, but
> > > still reproduces with libgomp tests like for-9.c, baseptrs-4.C etc.)
> >
> > It looks like a discrepancy between host and target that wasn't
> > properly analyzed to me.  The assert makes sure we read the expected
> > number of tree nodes - what's the tag when you hit this?
> > Please open a bug.
> Sure, will do shortly.
>
> I checked with for-9.c, and it shows the following state on hitting the 
> assert:
> reader_cache->nodes.length () = 2384, from+1 = 2385, tag: LTO_null
>
> From lto_read_decls:
>       unsigned from = data_in->reader_cache->nodes.length ();
>       /* Read and uniquify SCCs as in the input stream.  */
>       enum LTO_tags tag = streamer_read_record_start (&ib_main);
>       if (tag == LTO_tree_scc || tag == LTO_trees)
>
> In this case, tag is LTO_null, and it skips to the else part:
>       t = lto_input_tree_1 (&ib_main, data_in, tag, 0);
>       gcc_assert (data_in->reader_cache->nodes.length () == from + 1);
>
> and lto_input_tree_1 simply returns NULL_TREE if tag is LTO_null without any 
> additional streaming-in.
> So IIUC, data_in->reader_cache->nodes.length () won't change since there is 
> no tree node streamed in lto_input_tree_1, and
> thus reader_cache->length() remains equal to from (rather than from + 1), 
> which triggers the assert ?


Interesting.  The thing is that we don't expect LTO_null when
streaming in global decls
and types - we wouldn't have streamed a NULL decl or type.  So can you
debug this
on the writer side and see where we stream such?

Richard.

> >
> > >
> > > (4) AAarch64 uses OImode for 256-bit size array, which is not
> > supported on nvptx:
> > > This causes ~18 tests to fail.
> > >
> > > Can be reproduced with following simple test:
> > > int main()
> > > {
> > >   long c[4];
> > >   #pragma omp target map(c)
> > >     c[0] = 0;
> > >   return 0;
> > > }
> > >
> > > Compiling with -O2 -fopenmp -foffload=nvptx-none results in:
> > > lto1: fatal error: nvptx-none - 256-bit integer numbers unsupported
> > > (mode 'OI') compilation terminated.
> > > nvptx mkoffload: fatal error:
> > > ../install/bin/aarch64-unknown-linux-gnu-accel-nvptx-none-gcc
> > returned 1 exit status compilation terminated.
> > >
> > > This happens with AArch64 host because, it uses OImode (256-bit
> > > integer mode) for ARRAY_TYPE (long c[4] fits 256-bits), which isn't
> > > supported on nvptx. This decision is made during layout_type for
> > 'c', which calls mode_for_array, and mode_for_array uses target hooks
> > array_mode and array_mode_supported_p to determine target-specific
> > modes to use for ARRAY_TYPE.
> > > For x86_64, AFAIK, it uses BLKmode for ARRAY_TYPE.
> > >
> > > I have attached a "quick-and-dirty" patch (p-166-2.diff) which falls
> > > back to using BLKmode for ARRAY_TYPE if offloading is enabled, and
> > > avoids streaming-out target-specific int modes in
> > > lto_write_mode_table. I used default_scalar_mode_supported_p check
> > to test if the int_mode is "generic", but not sure if that's entirely
> > correct. The test compiles and runs OK with patch applied. I suppose a
> > more general solution would be to somehow "intersect" available
> > AArch64 modes with nvptx modes, and use those for offloading ?
> >
> > Hmm, I think we shouldn't stream modes for aggregates but instead let
> > stor-layout re-assign them.  They are not usually expected to match.
> > Not sure if we should do that generally or only for offload streaming
> > though (it does have an overhead to recompute the mode).  We could
> > stream VOIDmode here as indicator it needs recomputation (splitting
> > out sub-workers from layout_type, there's one already for record
> > type).
> Thanks, this indeed sounds better. I will try to work on patch following this 
> approach.
>
> Thanks,
> Prathamesh
> >
> > > With local workarounds for the above 4 issues, running make check-
> > target-libgomp shows following results:
> > >
> > >                 === libgomp Summary ===
> > >
> > > # of expected passes            16604
> > > # of unexpected failures        10
> > > # of expected failures          309
> > > # of untested testcases         54
> > > # of unresolved testcases       3
> > > # of unsupported tests          643
> > >
> > > The remaining issues are:
> > >
> > > (5) "error: alias definitions not supported in this configuration"
> > > Fails for pr96390.c, and pr96390.C. This seems to be related to
> > > https://gcc.gnu.org/PR97102
> > >
> > > (6) Execution Failures:
> > > - libgomp/pr104783.c
> > > - libgomp/pr104783-2.c
> > > I haven't investigated these yet.
> > >
> > > (7) Several warnings fail for libgomp.oacc-c-c++-common/acc_prof-
> > kernels-1.c and following excess errors:
> > > acc_prof-kernels-1.c:185:9: optimized: assigned OpenACC seq loop
> > > parallelism
> > > acc_prof-kernels-1.c:214:9: optimized: assigned OpenACC seq loop
> > > parallelism
> > > acc_prof-kernels-1.c:245:9: optimized: assigned OpenACC seq loop
> > > parallelism
> > >
> > > So far, I have only been testing make check-target-libgomp. Should I
> > be testing any additional parts of the testsuite for offloading
> > changes ?
> > >
> > > My initial goals are:
> > > (a) To get AArch64/nvptx offloading to work for above minimal test.
> > > (b) Testsuite results for libgomp on par with x86_64 for non VLA
> > code (as far as correctness is concerned).
> > > (c) After (a) and (b) are in place, try to enable support for
> > offloading with VLA/SVE.
> > >
> > > I am planning to address these issues and will post patches for the
> > same shortly. I will be grateful for any feedback or suggestions on
> > how to proceed forward.
> > >
> > > Thanks,
> > > Prathamesh

Re: [RFC] Summary of libgomp failures for offloading to nvptx from AArch64

Reply via email to