> -----Original Message----- > From: Prathamesh Kulkarni <prathame...@nvidia.com> > Sent: Tuesday, July 30, 2024 4:44 PM > To: Jakub Jelinek <ja...@redhat.com>; Richard Biener > <rguent...@suse.de> > Cc: Richard Sandiford <richard.sandif...@arm.com>; gcc- > patc...@gcc.gnu.org > Subject: RE: Support streaming of poly_int for offloading when it's > degree <= accel's NUM_POLY_INT_COEFFS > > External email: Use caution opening links or attachments > > > > -----Original Message----- > > From: Jakub Jelinek <ja...@redhat.com> > > Sent: Tuesday, July 30, 2024 3:16 PM > > To: Richard Biener <rguent...@suse.de> > > Cc: Richard Sandiford <richard.sandif...@arm.com>; Prathamesh > Kulkarni > > <prathame...@nvidia.com>; gcc-patches@gcc.gnu.org > > Subject: Re: Support streaming of poly_int for offloading when it's > > degree <= accel's NUM_POLY_INT_COEFFS > > > > External email: Use caution opening links or attachments > > > > > > On Tue, Jul 30, 2024 at 11:25:42AM +0200, Richard Biener wrote: > > > Only "relevant" stuff should be streamed - the offload code and > all > > > trees refered to. > > > > Yeah. > > > > > > > I think all current issues are because of poly-* leaking in > for > > > > > cases where a non-poly would have worked fine, but I have not > > had > > > > > a look myself. > > > > > > > > One of the cases that Prathamesh mentions is streaming the mode > > sizes. > > > > Are those modes "offload target modes" or "host modes"? It > seems > > > > like it shouldn't be an error for the host to have VLA modes per > > se. > > > > It's just that those modes can't be used in the host/offload > > interface. > > > > > > There's a requirement that a mode mapping exists from the host to > > > target enum machine_mode. I don't remember exactly how we compute > > > that mapping and whether streaming of some data (and thus poly- > int) > > > are part of this. > > > > During streaming out, the code records what machine modes are being > > streamed (in streamer_mode_table). > > For those modes (and their inner modes) then lto_write_mode_table > > should stream a table with mode details like class, bits, size, > inner > > mode, nunits, real mode format if any, etc. > > That table is then streamed in in the offloading compiler and it > > attempts to find corresponding modes (and emits fatal_error if there > > is no such mode; consider say x86_64 long double with XFmode being > > used in offloading code which doesn't have XFmode support). > > Now, because Richard S. changed GET_MODE_SIZE etc. to give poly_int > > rather than int, this has been changed to use bp_pack_poly_value; > but > > that relies on the same number of coefficients for poly_int, which > is > > not the case when e.g. offloading aarch64 to gcn or nvptx. > Indeed, for the minimal test: > int main() > { > int x; > #pragma omp target map (to: x) > { > x = 0; > } > return x; > } > > Streaming out mode_table from AArch64 shows: > mode = SI, mclass = 2, size = 4, prec = 32 mode = DI, mclass = 2, size > = 8, prec = 64 > > While streaming-in for nvptx shows: > mclass = 2, size = 4, prec = 0 > > The discrepancy happens because of differing value of > NUM_POLY_INT_COEFFS between AArch64 and nvptx. > From AArch64 it streams out size and prec as <4, 0> and <32, 0> > respectively, where 0 comes from coeffs[1]. > While streaming-in from nvptx, since NUM_POLY_INT_COEFFS is 1, it > incorrectly reads size as 4, and prec as 0. > > > > From what I can see, this mode table handling are the only uses of > > bp_pack_poly_value. So the options are either to stream at the > start > > of the mode table the NUM_POLY_INT_COEFFS value and in > > bp_unpack_poly_value pass to it what we've read and fill in any > > remaining coeffs with zeros, or in each bp_pack_poly_value stream > the > > number of coefficients and then stream that back in and fill in > > remaining ones (and diagnose if it would try to read non-zero > > coefficient which isn't stored). > This is the approach taken in proposed patch (stream-out degree of > poly_int followed by coeffs). > > > I think streaming NUM_POLY_INT_COEFFS once would be more compact (at > > least for non-aarch64/riscv targets). > I will try implementing this, thanks. Hi, The attached patch streams-out NUM_POLY_INT_COEFFS only once at beginning of mode_table, which should make LTO bytecode more compact for non VLA hosts. And changes streaming-in of poly_int as follows:
if (host_num_poly_int_coeffs <= NUM_POLY_INT_COEFFS) { for (i = 0; i < host_num_poly_int_coeffs; i++) poly_int.coeffs[i] = stream_in coeff; /* Set remaining coeffs to zero (like zero-extension). */ for (; i < NUM_POLY_INT_COEFFS; i++) poly_int.coeffs[i] = 0; } else { for (i = 0; i < NUM_POLY_INT_COEFFS; i++) poly_int.coeffs[i] = stream_in coeff; /* Ensure that degree of poly_int <= accel NUM_POLY_INT_COEFFS. */ for (; i < host_num_poly_int_coeffs; i++) { val = stream_in coeff; if (val != 0) error (); } } There are a couple of issues in the patch: (1) The patch streams out NUM_POLY_INT_COEFFS at beginning of mode_table, which should work for bp_unpack_poly_value, (since AFAIK, it's only called by lto_input_mode_table). However, I am not sure if we will always call lto_input_mode_table before streaming in poly_int64 / poly_uint64 ? Or should we stream out host NUM_POLY_INT_COEFFS at a different place in LTO bytecode ? (2) The patch defines POLY_INT_READ_COMMON macro for factoring out common code to read poly_int, however, I am not sure how to define a callback for different streaming functions like streamer_read_[u]hwi, bp_unpack value since they have different signatures. The patch uses an (ugly) kludge streamer_read_coeff, which is essentially a call to streaming-in function. Signed-off-by: Prathamesh Kulkarni <prathame...@nvidia.com> Thanks, Prathamesh > > Thanks, > Prathamesh > > > > Jakub
Partially support streaming of poly_int for offloading. The patch streams out host NUM_POLY_INT_COEFFS, and changes streaming in as follows: if (host_num_poly_int_coeffs <= NUM_POLY_INT_COEFFS) { for (i = 0; i < host_num_poly_int_coeffs; i++) poly_int.coeffs[i] = stream_in coeff; for (; i < NUM_POLY_INT_COEFFS; i++) poly_int.coeffs[i] = 0; } else { for (i = 0; i < NUM_POLY_INT_COEFFS; i++) poly_int.coeffs[i] = stream_in coeff; /* Ensure that degree of poly_int <= accel NUM_POLY_INT_COEFFS. */ for (; i < host_num_poly_int_coeffs; i++) { val = stream_in coeff; if (val != 0) error (); } } gcc/ChangeLog: PR ipa/96265 PR ipa/111937 * data-streamer-in.cc (streamer_read_poly_uint64): Remove code for streaming, and call POLY_INT_READ_COMMON instead. (streamer_read_poly_int64): Likewise. * data-streamer.cc (host_num_poly_int_coeffs): New variable. * data-streamer.h (host_num_poly_int_coeffs): Declare. (POLY_INT_READ_COMMON): New macro. (bp_unpack_poly_value): Remove code for streaming and call POLY_INT_READ_COMMON instead. * lto-streamer-in.cc (lto_input_mode_table): Stream-in host NUM_POLY_INT_COEFFS into host_num_poly_int_coeffs. * lto-streamer-out.cc (lto_write_mode_table): Stream out NUM_POLY_INT_COEFFS. * poly-int.h (MAX_NUM_POLY_INT_COEFFS_BITS): New macro. * tree-streamer-in.cc (lto_input_ts_poly_tree_pointers): Adjust streaming-in of poly_int. Signed-off-by: Prathamesh Kulkarni <prathame...@nvidia.com> diff --git a/gcc/data-streamer-in.cc b/gcc/data-streamer-in.cc index 7dce2928ef0..e18c6462316 100644 --- a/gcc/data-streamer-in.cc +++ b/gcc/data-streamer-in.cc @@ -183,9 +183,7 @@ poly_uint64 streamer_read_poly_uint64 (class lto_input_block *ib) { poly_uint64 res; - for (unsigned int i = 0; i < NUM_POLY_INT_COEFFS; ++i) - res.coeffs[i] = streamer_read_uhwi (ib); - return res; + POLY_INT_READ_COMMON(res, streamer_read_uhwi (ib)) } /* Read a poly_int64 from IB. */ @@ -194,9 +192,7 @@ poly_int64 streamer_read_poly_int64 (class lto_input_block *ib) { poly_int64 res; - for (unsigned int i = 0; i < NUM_POLY_INT_COEFFS; ++i) - res.coeffs[i] = streamer_read_hwi (ib); - return res; + POLY_INT_READ_COMMON(res, streamer_read_hwi (ib)) } /* Read gcov_type value from IB. */ diff --git a/gcc/data-streamer.cc b/gcc/data-streamer.cc index 346b294c72a..d2e9634d62f 100644 --- a/gcc/data-streamer.cc +++ b/gcc/data-streamer.cc @@ -28,6 +28,12 @@ along with GCC; see the file COPYING3. If not see #include "cgraph.h" #include "data-streamer.h" +/* While streaming-out, host NUM_POLY_INT_COEFFS is stored at beginning + of mode_table. While streaming-in, the value is read in + host_num_poly_int_coeffs. */ + +unsigned host_num_poly_int_coeffs; + /* Pack WORK into BP in a variant of uleb format. */ void diff --git a/gcc/data-streamer.h b/gcc/data-streamer.h index 6a2596134ce..3b26075c79f 100644 --- a/gcc/data-streamer.h +++ b/gcc/data-streamer.h @@ -50,6 +50,7 @@ void bp_pack_real_value (struct bitpack_d *, const REAL_VALUE_TYPE *); void bp_unpack_real_value (struct bitpack_d *, REAL_VALUE_TYPE *); unsigned HOST_WIDE_INT bp_unpack_var_len_unsigned (struct bitpack_d *); HOST_WIDE_INT bp_unpack_var_len_int (struct bitpack_d *); +extern unsigned host_num_poly_int_coeffs; /* In data-streamer-out.cc */ void streamer_write_zero (struct output_block *); @@ -194,15 +195,51 @@ bp_unpack_value (struct bitpack_d *bp, unsigned nbits) return val & mask; } +/* Common code for reading poly_int. + FIXME: streamer_read_coeff is an (ugly) kludge, it relies on the caller + passing a "function call" like bp_unpack_value (bp, nbits) or + streamer_read_uhwi (ib) which will read the next coeff from respective stream. + I am not sure if we could use a callback because streaming functions + streamer_read_[u]hwi, bp_unpack_value have different signatures. */ + +#define POLY_INT_READ_COMMON(x, streamer_read_coeff) \ +{ \ + unsigned i; \ + \ + if (host_num_poly_int_coeffs <= NUM_POLY_INT_COEFFS) \ + { \ + for (i = 0; i < host_num_poly_int_coeffs; i++) \ + x.coeffs[i] = streamer_read_coeff; \ + for (; i < NUM_POLY_INT_COEFFS; i++) \ + x.coeffs[i] = 0; \ + } \ + else \ + { \ + for (i = 0; i < NUM_POLY_INT_COEFFS; i++) \ + x.coeffs[i] = streamer_read_coeff; \ + \ + /* Ensure remaining coeffs are zero. */ \ + for (; i < host_num_poly_int_coeffs; i++) \ + { \ + __typeof(x.coeffs[0]) val = streamer_read_coeff; \ + if (val != 0) \ + fatal_error (input_location, \ + "Degree of %<poly_int%> exceeds " \ + "%<NUM_POLY_INT_COEFFS%> (%d)", \ + NUM_POLY_INT_COEFFS); \ + } \ + } \ + \ + return x; \ +} + /* Unpacks a polynomial value from the bit-packing context BP in which each coefficient has NBITS bits. */ inline poly_int<NUM_POLY_INT_COEFFS, bitpack_word_t> bp_unpack_poly_value (struct bitpack_d *bp, unsigned nbits) { poly_int<NUM_POLY_INT_COEFFS, bitpack_word_t> x; - for (int i = 0; i < NUM_POLY_INT_COEFFS; ++i) - x.coeffs[i] = bp_unpack_value (bp, nbits); - return x; + POLY_INT_READ_COMMON(x, bp_unpack_value (bp, nbits)) } diff --git a/gcc/lto-streamer-in.cc b/gcc/lto-streamer-in.cc index 2e592be8082..3e2c786fc36 100644 --- a/gcc/lto-streamer-in.cc +++ b/gcc/lto-streamer-in.cc @@ -2013,6 +2013,9 @@ lto_input_mode_table (struct lto_file_decl_data *file_data) header->string_size, vNULL); bitpack_d bp = streamer_read_bitpack (&ib); + host_num_poly_int_coeffs + = bp_unpack_value (&bp, MAX_NUM_POLY_INT_COEFFS_BITS); + unsigned mode_bits = bp_unpack_value (&bp, 5); unsigned char *table = ggc_cleared_vec_alloc<unsigned char> (1 << mode_bits); diff --git a/gcc/lto-streamer-out.cc b/gcc/lto-streamer-out.cc index c329ac8af95..091e4126965 100644 --- a/gcc/lto-streamer-out.cc +++ b/gcc/lto-streamer-out.cc @@ -3192,6 +3192,8 @@ lto_write_mode_table (void) ob = create_output_block (LTO_section_mode_table); bitpack_d bp = bitpack_create (ob->main_stream); + bp_pack_value (&bp, NUM_POLY_INT_COEFFS, MAX_NUM_POLY_INT_COEFFS_BITS); + /* Ensure that for GET_MODE_INNER (m) != m we have also the inner mode marked. */ for (int i = 0; i < (int) MAX_MACHINE_MODE; i++) diff --git a/gcc/poly-int.h b/gcc/poly-int.h index e3f8d4df716..8d3e6098f0b 100644 --- a/gcc/poly-int.h +++ b/gcc/poly-int.h @@ -354,6 +354,10 @@ struct poly_result<T1, T2, 2> ? (void) ((RES).coeffs[I] = VALUE) \ : (void) ((RES).coeffs[I].~C (), new (&(RES).coeffs[I]) C (VALUE))) +/* Number of bits needed to represent maximum value of + NUM_POLY_INT_COEFFS defined by any target. */ +#define MAX_NUM_POLY_INT_COEFFS_BITS (2) + /* poly_int_full and poly_int_hungry are used internally within poly_int for delegated initializers. poly_int_full indicates that a parameter pack has enough elements to initialize every coefficient. poly_int_hungry diff --git a/gcc/tree-streamer-in.cc b/gcc/tree-streamer-in.cc index c248a74f7a1..c41803aa21e 100644 --- a/gcc/tree-streamer-in.cc +++ b/gcc/tree-streamer-in.cc @@ -671,8 +671,29 @@ static void lto_input_ts_poly_tree_pointers (class lto_input_block *ib, class data_in *data_in, tree expr) { - for (unsigned int i = 0; i < NUM_POLY_INT_COEFFS; ++i) - POLY_INT_CST_COEFF (expr, i) = stream_read_tree_ref (ib, data_in); + unsigned i; + if (host_num_poly_int_coeffs <= NUM_POLY_INT_COEFFS) + { + for (i = 0; i < host_num_poly_int_coeffs; i++) + POLY_INT_CST_COEFF (expr, i) = stream_read_tree_ref (ib, data_in); + + tree coeff_type = TREE_TYPE (POLY_INT_CST_COEFF (expr, 0)); + for (; i < NUM_POLY_INT_COEFFS; i++) + POLY_INT_CST_COEFF (expr, i) = build_zero_cst (coeff_type); + } + else + { + for (i = 0; i < NUM_POLY_INT_COEFFS; i++) + POLY_INT_CST_COEFF (expr, i) = stream_read_tree_ref (ib, data_in); + for (; i < host_num_poly_int_coeffs; i++) + { + tree val = stream_read_tree_ref (ib, data_in); + if (!integer_zerop (val)) + fatal_error (input_location, + "Degree of %<poly_int%> exceeds " + "%<NUM_POLY_INT_COEFFS%>"); + } + } }