On Mon, Jul 29, 2024 at 1:35 PM Prathamesh Kulkarni <prathame...@nvidia.com> wrote: > > > > > -----Original Message----- > > From: Richard Biener <richard.guent...@gmail.com> > > Sent: Friday, July 26, 2024 6:51 PM > > To: Prathamesh Kulkarni <prathame...@nvidia.com> > > Cc: gcc@gcc.gnu.org > > Subject: Re: [RFC] Summary of libgomp failures for offloading to nvptx > > from AArch64 > > > > External email: Use caution opening links or attachments > > > > > > On Thu, Jul 25, 2024 at 3:36 PM Prathamesh Kulkarni via Gcc > > <gcc@gcc.gnu.org> wrote: > > > > > > Hi, > > > I am working on enabling offloading to nvptx from AAarch64 host. As > > > mentioned on wiki > > > (https://gcc.gnu.org/wiki/Offloading#Running_.27make_check.27), > > > I ran make check-target-libgomp on AAarch64 host (and no GPU) with > > following results: > > > > > > === libgomp Summary === > > > > > > # of expected passes 14568 > > > # of unexpected failures 1023 > > > # of expected failures 309 > > > # of untested testcases 54 > > > # of unresolved testcases 992 > > > # of unsupported tests 644 > > > > > > It seems majority of the tests fail due to the following 4 issues: > > > > > > * Compiling a minimal test-case: > > > > > > int main() > > > { > > > int x; > > > #pragma omp target map (to: x) > > > { > > > x = 0; > > > } > > > return x; > > > } > > > > > > Compiling with -fopenmp -foffload=nvptx-none results in following > > issues: > > > > > > (1) Differing values of NUM_POLY_INT_COEFFS between host and > > accelerator, which results in following ICE: > > > > > > 0x1a6e0a7 pp_quoted_string > > > ../../gcc/gcc/pretty-print.cc:2277 > > > 0x1a6ffb3 pp_format(pretty_printer*, text_info*, urlifier const*) > > > ../../gcc/gcc/pretty-print.cc:1634 > > > 0x1a4a3f3 diagnostic_context::report_diagnostic(diagnostic_info*) > > > ../../gcc/gcc/diagnostic.cc:1612 > > > 0x1a4a727 diagnostic_impl > > > ../../gcc/gcc/diagnostic.cc:1775 0x1a4e20b > > > fatal_error(unsigned int, char const*, ...) > > > ../../gcc/gcc/diagnostic.cc:2218 0xb3088f > > > lto_input_mode_table(lto_file_decl_data*) > > > ../../gcc/gcc/lto-streamer-in.cc:2121 > > > 0x6f5cdf lto_file_finalize > > > ../../gcc/gcc/lto/lto-common.cc:2285 > > > 0x6f5cdf lto_create_files_from_ids > > > ../../gcc/gcc/lto/lto-common.cc:2309 > > > 0x6f5cdf lto_file_read > > > ../../gcc/gcc/lto/lto-common.cc:2364 > > > 0x6f5cdf read_cgraph_and_symbols(unsigned int, char const**) > > > ../../gcc/gcc/lto/lto-common.cc:2812 > > > 0x6cfb93 lto_main() > > > ../../gcc/gcc/lto/lto.cc:658 > > > > > > This is already tracked in https://gcc.gnu.org/PR96265 (and related > > > PR's) > > > > > > Streaming out mode_table: > > > mode = SI, mclass = 2, size = 4, prec = 32 mode = DI, mclass = 2, > > size > > > = 8, prec = 64 > > > > > > Streaming in mode_table (in lto_input_mode_table): > > > mclass = 2, size = 4, prec = 0 > > > (and then calculates the correct mode value by iterating over all > > > modes of mclass starting from narrowest mode) > > > > > > The issue is that the value for prec is not getting streamed-in > > > correctly for SImode as seen above. While streaming out from AArch64 > > host, it is 32, but while streaming in for nvptx, it is 0. This > > happens because of differing values of NUM_POLY_INT_COEFFS between > > AArch64 and nvptx backend. > > > > > > Since NUM_POLY_INT_COEFFS is 2 for aarch64, the streamed-out values > > > for mode, precision would be <4, 0> and <32, 0> respectively > > > (streamed-out in bp_pack_poly_value). Both zeros come from coeffs[1] > > > of size and prec. While streaming in however, NUM_POLY_INT_COEFFS is > > 1 for nvptx, and thus it incorrectly treats <4, 0> as size and > > precision respectively, which is why precision gets streamed in as 0, > > and thus it encounters the above ICE. > > > > > > Supporting non VLA code with offloading: > > > > > > In the general case, it's hard to support offloading for arbitrary > > poly_ints when NUM_POLY_INT_COEFFS differs for host and accelerator. > > > For example, it's not possible to represent a degree-2 poly_int like > > 4 + 4x (as-is) on an accelerator with NUM_POLY_INT_COEFFS == 1. > > > > > > However, IIUC, we can support offloading for restricted set of > > > poly_ints whose degree <= accel's NUM_POLY_INT_COEFFS, since they > > can > > > be represented on accelerator ? For a hypothetical example, if host > > NUM_POLY_INT_COEFFS == 3 and accel NUM_POLY_INT_COEFFS == 2, then I > > suppose we could represent a degree 2 poly_int on accelerator, but not > > a degree 3 poly_int like 3+4x+5x^2 ? > > > > > > Based on that, I have come up with following approach in attached > > "quick-and-dirty" patch (p-163-2.diff): > > > Stream-out host NUM_POLY_INT_COEFFS, and while streaming-in during > > lto1, compare it with accelerator's NUM_POLY_INT_COEFFS as follows: > > > > > > Stream in host_num_poly_int_coeffs; > > > if (host_num_poly_int_coeffs == NUM_POLY_INT_COEFFS) // > > NUM_POLY_INT_COEFFS represents accelerator's value here. > > > { > > > /* Both are equal, proceed to unpacking NUM_POLY_INT_COEFFS > > words > > > from bitstream. */ } else if (host_num_poly_int_coeffs < > > > NUM_POLY_INT_COEFFS) { > > > /* Unpack host_num_poly_int_coeffs words and zero out remaining > > > higher coeffs (similar to zero-extension). */ } else { > > > /* Unpack host_num_poly_int_coeffs words and ensure that degree > > of > > > streamed-out poly_int <= NUM_POLY_INT_COEFFS. */ } > > > > > > For example, with host NUM_POLY_INT_COEFFS == 2 and accel > > > NUM_POLY_INT_COEFFS == 1, this will allow streaming of "degree-1" > > poly_ints like 4+0x (which will degenerate to constant 4), but give an > > error for streaming degree-2 poly_int like 4+4x. > > > > > > Following this approach, I am assuming we can support AArch64/nvptx > > > offloading for non VLA code, since poly_ints used for representing > > > various artefacts like mode_size, mode_precision, vector length etc. > > will be degree-1 poly_int for scalar variables and fixed-length > > vectors (and thus degenerate to constants). With the patch applied, it > > proceeds forward from this point, but fails at a later stage (see > > below). > > > > > > Does this approach look reasonable for supporting offloading for non > > VLA code ? > > > Are there any cases I may have overlooked, where offloading will > > still fail for non-VLA code due to differing NUM_POLY_INT_COEFFS issue > > ? > > > > I think I'd change how we stream POLY_INTs and make that "independent" > > of NUM_POLY_INT_COEFFS > > in that I'd stream the effective number of coeffs required - as you > > suggest "trailing" zero coeffs do not need to be represented. We > > always stream coeff zero. > > > > On read in when that N is bigger than NUM_POLY_INT_COEFFS we have to > > error and we zero-fill extra elements (using the type of coeff zero). > > > > > (2) nvptx mkoffload.cc passes -m64/-m32 to host compiler if > > > -foffload-abi=lp64/ilp32 After applying workaround for the above > > assertion failure, it hits the following error: > > > gcc: error: unrecognized command-line option '-m64' > > > nvptx mkoffload: fatal error: ../install/bin/gcc returned 1 exit > > > status compilation terminated. > > > > > > This happens because nvptx/mkoffload.cc:compile_native passes > > > -m64/-m32 to host compiler depending on whether offload_abi is > > OFFLOAD_ABI_LP64 or OFFLOAD_ABI_ILP32, and aarch64 backend doesn't > > recognize these options. > > > > > > I suppose a simple solution to check if host_compiler supports a > > > particular command-line option, would be to create a dummy C file > > and > > > check if the command "host_compiler <opt> dummy_file.c" returns zero > > exit status. Alternative could be to check exit status for > > "host_compiler <opt> --version", once http://gcc.gnu.org/PR116050 is > > fixed, but I am not sure if either is an ideal solution. > > > > I think an explicit configure when building the offload compiler would > > be cleaner, like simply adding an additional > > AC_SUBST(host_no_multilibs) var computed from the --enable-as- > > accelerator-for target triplet? > > > > > With workarounds for these 2 issues, the minimal test builds and > > runs successfully. > > > > > > (3) Assertion error in lto_read_decls during lto1: > > > There are several failures (~350+) in the testsuite caused due to > > the > > > following assert in lto_read_decls: > > > gcc_assert (data_in->reader_cache->nodes.length () == from + 1); > > > > > > AFAIU, this seems to happen because of presence of LTO_null tag. The > > > following workaround avoids hitting the assert, but am not sure if > > it's the right fix: > > > > > > t = lto_input_tree_1 (&ib_main, data_in, tag, 0); > > > + if (t == NULL_TREE) > > > + continue; > > > gcc_assert (data_in->reader_cache->nodes.length () == from + > > > 1); > > > > > > (FWIW, this was reproducible with the above minimal test, but has > > > seemingly gone away for it after updating the sources recently, but > > > still reproduces with libgomp tests like for-9.c, baseptrs-4.C etc.) > > > > It looks like a discrepancy between host and target that wasn't > > properly analyzed to me. The assert makes sure we read the expected > > number of tree nodes - what's the tag when you hit this? > > Please open a bug. > Sure, will do shortly. > > I checked with for-9.c, and it shows the following state on hitting the > assert: > reader_cache->nodes.length () = 2384, from+1 = 2385, tag: LTO_null > > From lto_read_decls: > unsigned from = data_in->reader_cache->nodes.length (); > /* Read and uniquify SCCs as in the input stream. */ > enum LTO_tags tag = streamer_read_record_start (&ib_main); > if (tag == LTO_tree_scc || tag == LTO_trees) > > In this case, tag is LTO_null, and it skips to the else part: > t = lto_input_tree_1 (&ib_main, data_in, tag, 0); > gcc_assert (data_in->reader_cache->nodes.length () == from + 1); > > and lto_input_tree_1 simply returns NULL_TREE if tag is LTO_null without any > additional streaming-in. > So IIUC, data_in->reader_cache->nodes.length () won't change since there is > no tree node streamed in lto_input_tree_1, and > thus reader_cache->length() remains equal to from (rather than from + 1), > which triggers the assert ?
Interesting. The thing is that we don't expect LTO_null when streaming in global decls and types - we wouldn't have streamed a NULL decl or type. So can you debug this on the writer side and see where we stream such? Richard. > > > > > > > > (4) AAarch64 uses OImode for 256-bit size array, which is not > > supported on nvptx: > > > This causes ~18 tests to fail. > > > > > > Can be reproduced with following simple test: > > > int main() > > > { > > > long c[4]; > > > #pragma omp target map(c) > > > c[0] = 0; > > > return 0; > > > } > > > > > > Compiling with -O2 -fopenmp -foffload=nvptx-none results in: > > > lto1: fatal error: nvptx-none - 256-bit integer numbers unsupported > > > (mode 'OI') compilation terminated. > > > nvptx mkoffload: fatal error: > > > ../install/bin/aarch64-unknown-linux-gnu-accel-nvptx-none-gcc > > returned 1 exit status compilation terminated. > > > > > > This happens with AArch64 host because, it uses OImode (256-bit > > > integer mode) for ARRAY_TYPE (long c[4] fits 256-bits), which isn't > > > supported on nvptx. This decision is made during layout_type for > > 'c', which calls mode_for_array, and mode_for_array uses target hooks > > array_mode and array_mode_supported_p to determine target-specific > > modes to use for ARRAY_TYPE. > > > For x86_64, AFAIK, it uses BLKmode for ARRAY_TYPE. > > > > > > I have attached a "quick-and-dirty" patch (p-166-2.diff) which falls > > > back to using BLKmode for ARRAY_TYPE if offloading is enabled, and > > > avoids streaming-out target-specific int modes in > > > lto_write_mode_table. I used default_scalar_mode_supported_p check > > to test if the int_mode is "generic", but not sure if that's entirely > > correct. The test compiles and runs OK with patch applied. I suppose a > > more general solution would be to somehow "intersect" available > > AArch64 modes with nvptx modes, and use those for offloading ? > > > > Hmm, I think we shouldn't stream modes for aggregates but instead let > > stor-layout re-assign them. They are not usually expected to match. > > Not sure if we should do that generally or only for offload streaming > > though (it does have an overhead to recompute the mode). We could > > stream VOIDmode here as indicator it needs recomputation (splitting > > out sub-workers from layout_type, there's one already for record > > type). > Thanks, this indeed sounds better. I will try to work on patch following this > approach. > > Thanks, > Prathamesh > > > > > With local workarounds for the above 4 issues, running make check- > > target-libgomp shows following results: > > > > > > === libgomp Summary === > > > > > > # of expected passes 16604 > > > # of unexpected failures 10 > > > # of expected failures 309 > > > # of untested testcases 54 > > > # of unresolved testcases 3 > > > # of unsupported tests 643 > > > > > > The remaining issues are: > > > > > > (5) "error: alias definitions not supported in this configuration" > > > Fails for pr96390.c, and pr96390.C. This seems to be related to > > > https://gcc.gnu.org/PR97102 > > > > > > (6) Execution Failures: > > > - libgomp/pr104783.c > > > - libgomp/pr104783-2.c > > > I haven't investigated these yet. > > > > > > (7) Several warnings fail for libgomp.oacc-c-c++-common/acc_prof- > > kernels-1.c and following excess errors: > > > acc_prof-kernels-1.c:185:9: optimized: assigned OpenACC seq loop > > > parallelism > > > acc_prof-kernels-1.c:214:9: optimized: assigned OpenACC seq loop > > > parallelism > > > acc_prof-kernels-1.c:245:9: optimized: assigned OpenACC seq loop > > > parallelism > > > > > > So far, I have only been testing make check-target-libgomp. Should I > > be testing any additional parts of the testsuite for offloading > > changes ? > > > > > > My initial goals are: > > > (a) To get AArch64/nvptx offloading to work for above minimal test. > > > (b) Testsuite results for libgomp on par with x86_64 for non VLA > > code (as far as correctness is concerned). > > > (c) After (a) and (b) are in place, try to enable support for > > offloading with VLA/SVE. > > > > > > I am planning to address these issues and will post patches for the > > same shortly. I will be grateful for any feedback or suggestions on > > how to proceed forward. > > > > > > Thanks, > > > Prathamesh