On Thu, Jul 25, 2024 at 3:36 PM Prathamesh Kulkarni via Gcc <gcc@gcc.gnu.org> wrote: > > Hi, > I am working on enabling offloading to nvptx from AAarch64 host. As mentioned > on wiki (https://gcc.gnu.org/wiki/Offloading#Running_.27make_check.27), > I ran make check-target-libgomp on AAarch64 host (and no GPU) with following > results: > > === libgomp Summary === > > # of expected passes 14568 > # of unexpected failures 1023 > # of expected failures 309 > # of untested testcases 54 > # of unresolved testcases 992 > # of unsupported tests 644 > > It seems majority of the tests fail due to the following 4 issues: > > * Compiling a minimal test-case: > > int main() > { > int x; > #pragma omp target map (to: x) > { > x = 0; > } > return x; > } > > Compiling with -fopenmp -foffload=nvptx-none results in following issues: > > (1) Differing values of NUM_POLY_INT_COEFFS between host and accelerator, > which results in following ICE: > > 0x1a6e0a7 pp_quoted_string > ../../gcc/gcc/pretty-print.cc:2277 > 0x1a6ffb3 pp_format(pretty_printer*, text_info*, urlifier const*) > ../../gcc/gcc/pretty-print.cc:1634 > 0x1a4a3f3 diagnostic_context::report_diagnostic(diagnostic_info*) > ../../gcc/gcc/diagnostic.cc:1612 > 0x1a4a727 diagnostic_impl > ../../gcc/gcc/diagnostic.cc:1775 > 0x1a4e20b fatal_error(unsigned int, char const*, ...) > ../../gcc/gcc/diagnostic.cc:2218 > 0xb3088f lto_input_mode_table(lto_file_decl_data*) > ../../gcc/gcc/lto-streamer-in.cc:2121 > 0x6f5cdf lto_file_finalize > ../../gcc/gcc/lto/lto-common.cc:2285 > 0x6f5cdf lto_create_files_from_ids > ../../gcc/gcc/lto/lto-common.cc:2309 > 0x6f5cdf lto_file_read > ../../gcc/gcc/lto/lto-common.cc:2364 > 0x6f5cdf read_cgraph_and_symbols(unsigned int, char const**) > ../../gcc/gcc/lto/lto-common.cc:2812 > 0x6cfb93 lto_main() > ../../gcc/gcc/lto/lto.cc:658 > > This is already tracked in https://gcc.gnu.org/PR96265 (and related PR's) > > Streaming out mode_table: > mode = SI, mclass = 2, size = 4, prec = 32 > mode = DI, mclass = 2, size = 8, prec = 64 > > Streaming in mode_table (in lto_input_mode_table): > mclass = 2, size = 4, prec = 0 > (and then calculates the correct mode value by iterating over all modes of > mclass starting from narrowest mode) > > The issue is that the value for prec is not getting streamed-in correctly for > SImode as seen above. While streaming out from AArch64 host, > it is 32, but while streaming in for nvptx, it is 0. This happens because of > differing values of NUM_POLY_INT_COEFFS between AArch64 and nvptx backend. > > Since NUM_POLY_INT_COEFFS is 2 for aarch64, the streamed-out values for mode, > precision would be <4, 0> and <32, 0> > respectively (streamed-out in bp_pack_poly_value). Both zeros come from > coeffs[1] of size and prec. While streaming in however, > NUM_POLY_INT_COEFFS is 1 for nvptx, and thus it incorrectly treats <4, 0> as > size and precision respectively, which is why precision > gets streamed in as 0, and thus it encounters the above ICE. > > Supporting non VLA code with offloading: > > In the general case, it's hard to support offloading for arbitrary poly_ints > when NUM_POLY_INT_COEFFS differs for host and accelerator. > For example, it's not possible to represent a degree-2 poly_int like 4 + 4x > (as-is) on an accelerator with NUM_POLY_INT_COEFFS == 1. > > However, IIUC, we can support offloading for restricted set of poly_ints > whose degree <= accel's NUM_POLY_INT_COEFFS, since they can be > represented on accelerator ? For a hypothetical example, if host > NUM_POLY_INT_COEFFS == 3 and accel NUM_POLY_INT_COEFFS == 2, then I suppose > we could represent a degree 2 poly_int on accelerator, but not a degree 3 > poly_int like 3+4x+5x^2 ? > > Based on that, I have come up with following approach in attached > "quick-and-dirty" patch (p-163-2.diff): > Stream-out host NUM_POLY_INT_COEFFS, and while streaming-in during lto1, > compare it with accelerator's NUM_POLY_INT_COEFFS as follows: > > Stream in host_num_poly_int_coeffs; > if (host_num_poly_int_coeffs == NUM_POLY_INT_COEFFS) // NUM_POLY_INT_COEFFS > represents accelerator's value here. > { > /* Both are equal, proceed to unpacking NUM_POLY_INT_COEFFS words from > bitstream. */ > } > else if (host_num_poly_int_coeffs < NUM_POLY_INT_COEFFS) > { > /* Unpack host_num_poly_int_coeffs words and zero out remaining higher > coeffs (similar to zero-extension). */ > } > else > { > /* Unpack host_num_poly_int_coeffs words and ensure that degree of > streamed-out poly_int <= NUM_POLY_INT_COEFFS. */ > } > > For example, with host NUM_POLY_INT_COEFFS == 2 and accel NUM_POLY_INT_COEFFS > == 1, this will allow streaming of "degree-1" poly_ints > like 4+0x (which will degenerate to constant 4), but give an error for > streaming degree-2 poly_int like 4+4x. > > Following this approach, I am assuming we can support AArch64/nvptx > offloading for non VLA code, since poly_ints used for representing various > artefacts like mode_size, mode_precision, vector length etc. will be degree-1 > poly_int for scalar variables and fixed-length vectors > (and thus degenerate to constants). With the patch applied, it proceeds > forward from this point, but fails at a later stage (see below). > > Does this approach look reasonable for supporting offloading for non VLA code > ? > Are there any cases I may have overlooked, where offloading will still fail > for non-VLA code due to differing NUM_POLY_INT_COEFFS issue ?
I think I'd change how we stream POLY_INTs and make that "independent" of NUM_POLY_INT_COEFFS in that I'd stream the effective number of coeffs required - as you suggest "trailing" zero coeffs do not need to be represented. We always stream coeff zero. On read in when that N is bigger than NUM_POLY_INT_COEFFS we have to error and we zero-fill extra elements (using the type of coeff zero). > (2) nvptx mkoffload.cc passes -m64/-m32 to host compiler if > -foffload-abi=lp64/ilp32 > After applying workaround for the above assertion failure, it hits the > following error: > gcc: error: unrecognized command-line option '-m64' > nvptx mkoffload: fatal error: ../install/bin/gcc returned 1 exit status > compilation terminated. > > This happens because nvptx/mkoffload.cc:compile_native passes -m64/-m32 to > host compiler depending on whether > offload_abi is OFFLOAD_ABI_LP64 or OFFLOAD_ABI_ILP32, and aarch64 backend > doesn't recognize these options. > > I suppose a simple solution to check if host_compiler supports a particular > command-line option, would be to create a dummy C file and > check if the command "host_compiler <opt> dummy_file.c" returns zero exit > status. Alternative could be to check exit status for > "host_compiler <opt> --version", once http://gcc.gnu.org/PR116050 is fixed, > but I am not sure if either is an ideal solution. I think an explicit configure when building the offload compiler would be cleaner, like simply adding an additional AC_SUBST(host_no_multilibs) var computed from the --enable-as-accelerator-for target triplet? > With workarounds for these 2 issues, the minimal test builds and runs > successfully. > > (3) Assertion error in lto_read_decls during lto1: > There are several failures (~350+) in the testsuite caused due to the > following > assert in lto_read_decls: > gcc_assert (data_in->reader_cache->nodes.length () == from + 1); > > AFAIU, this seems to happen because of presence of LTO_null tag. The > following workaround avoids hitting the assert, > but am not sure if it's the right fix: > > t = lto_input_tree_1 (&ib_main, data_in, tag, 0); > + if (t == NULL_TREE) > + continue; > gcc_assert (data_in->reader_cache->nodes.length () == from + 1); > > (FWIW, this was reproducible with the above minimal test, but has seemingly > gone away for it after updating the sources recently, > but still reproduces with libgomp tests like for-9.c, baseptrs-4.C etc.) It looks like a discrepancy between host and target that wasn't properly analyzed to me. The assert makes sure we read the expected number of tree nodes - what's the tag when you hit this? Please open a bug. > > (4) AAarch64 uses OImode for 256-bit size array, which is not supported on > nvptx: > This causes ~18 tests to fail. > > Can be reproduced with following simple test: > int main() > { > long c[4]; > #pragma omp target map(c) > c[0] = 0; > return 0; > } > > Compiling with -O2 -fopenmp -foffload=nvptx-none results in: > lto1: fatal error: nvptx-none - 256-bit integer numbers unsupported (mode > 'OI') > compilation terminated. > nvptx mkoffload: fatal error: > ../install/bin/aarch64-unknown-linux-gnu-accel-nvptx-none-gcc returned 1 exit > status > compilation terminated. > > This happens with AArch64 host because, it uses OImode (256-bit integer mode) > for ARRAY_TYPE (long c[4] fits 256-bits), which isn't > supported on nvptx. This decision is made during layout_type for 'c', which > calls mode_for_array, and mode_for_array uses target hooks > array_mode and array_mode_supported_p to determine target-specific modes to > use for ARRAY_TYPE. > For x86_64, AFAIK, it uses BLKmode for ARRAY_TYPE. > > I have attached a "quick-and-dirty" patch (p-166-2.diff) which falls back to > using BLKmode for ARRAY_TYPE if offloading is enabled, > and avoids streaming-out target-specific int modes in lto_write_mode_table. I > used default_scalar_mode_supported_p check to test if > the int_mode is "generic", but not sure if that's entirely correct. The test > compiles and runs OK with patch applied. I suppose a more general > solution would be to somehow "intersect" available AArch64 modes with nvptx > modes, and use those for offloading ? Hmm, I think we shouldn't stream modes for aggregates but instead let stor-layout re-assign them. They are not usually expected to match. Not sure if we should do that generally or only for offload streaming though (it does have an overhead to recompute the mode). We could stream VOIDmode here as indicator it needs recomputation (splitting out sub-workers from layout_type, there's one already for record type). > With local workarounds for the above 4 issues, running make > check-target-libgomp shows following results: > > === libgomp Summary === > > # of expected passes 16604 > # of unexpected failures 10 > # of expected failures 309 > # of untested testcases 54 > # of unresolved testcases 3 > # of unsupported tests 643 > > The remaining issues are: > > (5) "error: alias definitions not supported in this configuration" > Fails for pr96390.c, and pr96390.C. This seems to be related to > https://gcc.gnu.org/PR97102 > > (6) Execution Failures: > - libgomp/pr104783.c > - libgomp/pr104783-2.c > I haven't investigated these yet. > > (7) Several warnings fail for libgomp.oacc-c-c++-common/acc_prof-kernels-1.c > and following excess errors: > acc_prof-kernels-1.c:185:9: optimized: assigned OpenACC seq loop parallelism > acc_prof-kernels-1.c:214:9: optimized: assigned OpenACC seq loop parallelism > acc_prof-kernels-1.c:245:9: optimized: assigned OpenACC seq loop parallelism > > So far, I have only been testing make check-target-libgomp. Should I be > testing any additional parts of the testsuite for offloading changes ? > > My initial goals are: > (a) To get AArch64/nvptx offloading to work for above minimal test. > (b) Testsuite results for libgomp on par with x86_64 for non VLA code (as far > as correctness is concerned). > (c) After (a) and (b) are in place, try to enable support for offloading with > VLA/SVE. > > I am planning to address these issues and will post patches for the same > shortly. I will be grateful for any feedback or suggestions on how to proceed > forward. > > Thanks, > Prathamesh