RE: Support streaming of poly_int for offloading when it's degree <= accel's NUM_POLY_INT_COEFFS

Prathamesh Kulkarni Wed, 31 Jul 2024 07:58:53 -0700


> -----Original Message-----
> From: Prathamesh Kulkarni <prathame...@nvidia.com>
> Sent: Tuesday, July 30, 2024 4:44 PM
> To: Jakub Jelinek <ja...@redhat.com>; Richard Biener
> <rguent...@suse.de>
> Cc: Richard Sandiford <richard.sandif...@arm.com>; gcc-
> patc...@gcc.gnu.org
> Subject: RE: Support streaming of poly_int for offloading when it's
> degree <= accel's NUM_POLY_INT_COEFFS
> 
> External email: Use caution opening links or attachments
> 
> 
> > -----Original Message-----
> > From: Jakub Jelinek <ja...@redhat.com>
> > Sent: Tuesday, July 30, 2024 3:16 PM
> > To: Richard Biener <rguent...@suse.de>
> > Cc: Richard Sandiford <richard.sandif...@arm.com>; Prathamesh
> Kulkarni
> > <prathame...@nvidia.com>; gcc-patches@gcc.gnu.org
> > Subject: Re: Support streaming of poly_int for offloading when it's
> > degree <= accel's NUM_POLY_INT_COEFFS
> >
> > External email: Use caution opening links or attachments
> >
> >
> > On Tue, Jul 30, 2024 at 11:25:42AM +0200, Richard Biener wrote:
> > > Only "relevant" stuff should be streamed - the offload code and
> all
> > > trees refered to.
> >
> > Yeah.
> >
> > > > > I think all current issues are because of poly-* leaking in
> for
> > > > > cases where a non-poly would have worked fine, but I have not
> > had
> > > > > a look myself.
> > > >
> > > > One of the cases that Prathamesh mentions is streaming the mode
> > sizes.
> > > > Are those modes "offload target modes" or "host modes"?  It
> seems
> > > > like it shouldn't be an error for the host to have VLA modes per
> > se.
> > > > It's just that those modes can't be used in the host/offload
> > interface.
> > >
> > > There's a requirement that a mode mapping exists from the host to
> > > target enum machine_mode.  I don't remember exactly how we compute
> > > that mapping and whether streaming of some data (and thus poly-
> int)
> > > are part of this.
> >
> > During streaming out, the code records what machine modes are being
> > streamed (in streamer_mode_table).
> > For those modes (and their inner modes) then lto_write_mode_table
> > should stream a table with mode details like class, bits, size,
> inner
> > mode, nunits, real mode format if any, etc.
> > That table is then streamed in in the offloading compiler and it
> > attempts to find corresponding modes (and emits fatal_error if there
> > is no such mode; consider say x86_64 long double with XFmode being
> > used in offloading code which doesn't have XFmode support).
> > Now, because Richard S. changed GET_MODE_SIZE etc. to give poly_int
> > rather than int, this has been changed to use bp_pack_poly_value;
> but
> > that relies on the same number of coefficients for poly_int, which
> is
> > not the case when e.g. offloading aarch64 to gcn or nvptx.
> Indeed, for the minimal test:
> int main()
> {
>   int x;
>   #pragma omp target map (to: x)
>   {
>     x = 0;
>   }
>   return x;
> }
> 
> Streaming out mode_table from AArch64 shows:
> mode = SI, mclass = 2, size = 4, prec = 32 mode = DI, mclass = 2, size
> = 8, prec = 64
> 
> While streaming-in for nvptx shows:
> mclass = 2, size = 4, prec = 0
> 
> The discrepancy happens because of differing value of
> NUM_POLY_INT_COEFFS between AArch64 and nvptx.
> From AArch64 it streams out size and prec as <4, 0> and <32, 0>
> respectively, where 0 comes from coeffs[1].
> While streaming-in from nvptx, since NUM_POLY_INT_COEFFS is 1, it
> incorrectly reads size as 4, and prec as 0.
> >
> > From what I can see, this mode table handling are the only uses of
> > bp_pack_poly_value.  So the options are either to stream at the
> start
> > of the mode table the NUM_POLY_INT_COEFFS value and in
> > bp_unpack_poly_value pass to it what we've read and fill in any
> > remaining coeffs with zeros, or in each bp_pack_poly_value stream
> the
> > number of coefficients and then stream that back in and fill in
> > remaining ones (and diagnose if it would try to read non-zero
> > coefficient which isn't stored).
> This is the approach taken in proposed patch (stream-out degree of
> poly_int followed by coeffs).
> 
> > I think streaming NUM_POLY_INT_COEFFS once would be more compact (at
> > least for non-aarch64/riscv targets).
> I will try implementing this, thanks.
Hi,
The attached patch streams-out NUM_POLY_INT_COEFFS only once at beginning of 
mode_table, which should make LTO bytecode more compact
for non VLA hosts. And changes streaming-in of poly_int as follows:


if (host_num_poly_int_coeffs <= NUM_POLY_INT_COEFFS)
{
  for (i = 0; i < host_num_poly_int_coeffs; i++)
    poly_int.coeffs[i] = stream_in coeff;

  /* Set remaining coeffs to zero (like zero-extension).  */
  for (; i < NUM_POLY_INT_COEFFS; i++)
    poly_int.coeffs[i] = 0;
}
else
{
  for (i = 0; i < NUM_POLY_INT_COEFFS; i++)
    poly_int.coeffs[i] = stream_in coeff;

  /* Ensure that degree of poly_int <= accel NUM_POLY_INT_COEFFS.  */
  for (; i < host_num_poly_int_coeffs; i++)
    {
      val = stream_in coeff;
      if (val != 0)
        error ();
    }
}

There are a couple of issues in the patch:
(1) The patch streams out NUM_POLY_INT_COEFFS at beginning of mode_table, which 
should work for bp_unpack_poly_value,
(since AFAIK, it's only called by lto_input_mode_table). However, I am not sure 
if we will always call lto_input_mode_table
before streaming in poly_int64 / poly_uint64 ? Or should we stream out host 
NUM_POLY_INT_COEFFS at a different place in LTO bytecode ?

(2) The patch defines POLY_INT_READ_COMMON macro for factoring out common code 
to read poly_int, however, I am not sure
how to define a callback for different streaming functions like 
streamer_read_[u]hwi, bp_unpack value since they have different
signatures. The patch uses an (ugly) kludge streamer_read_coeff, which is 
essentially a call to streaming-in function.

Signed-off-by: Prathamesh Kulkarni <prathame...@nvidia.com>

Thanks,
Prathamesh
> 
> Thanks,
> Prathamesh
> >
> >         Jakub

Partially support streaming of poly_int for offloading.

The patch streams out host NUM_POLY_INT_COEFFS, and changes
streaming in as follows:

if (host_num_poly_int_coeffs <= NUM_POLY_INT_COEFFS)
{
  for (i = 0; i < host_num_poly_int_coeffs; i++)
    poly_int.coeffs[i] = stream_in coeff;
  for (; i < NUM_POLY_INT_COEFFS; i++)
    poly_int.coeffs[i] = 0;
}
else
{
  for (i = 0; i < NUM_POLY_INT_COEFFS; i++)
    poly_int.coeffs[i] = stream_in coeff;

  /* Ensure that degree of poly_int <= accel NUM_POLY_INT_COEFFS.  */ 
  for (; i < host_num_poly_int_coeffs; i++)
    {
      val = stream_in coeff;
      if (val != 0)
        error ();
    }
}

gcc/ChangeLog:
        PR ipa/96265
        PR ipa/111937
        * data-streamer-in.cc (streamer_read_poly_uint64): Remove code for
        streaming, and call POLY_INT_READ_COMMON instead.
        (streamer_read_poly_int64): Likewise.
        * data-streamer.cc (host_num_poly_int_coeffs): New variable.
        * data-streamer.h (host_num_poly_int_coeffs): Declare.
        (POLY_INT_READ_COMMON): New macro.
        (bp_unpack_poly_value): Remove code for streaming and call
        POLY_INT_READ_COMMON instead.
        * lto-streamer-in.cc (lto_input_mode_table): Stream-in host
        NUM_POLY_INT_COEFFS into host_num_poly_int_coeffs.
        * lto-streamer-out.cc (lto_write_mode_table): Stream out
        NUM_POLY_INT_COEFFS.
        * poly-int.h (MAX_NUM_POLY_INT_COEFFS_BITS): New macro.
        * tree-streamer-in.cc (lto_input_ts_poly_tree_pointers): Adjust
        streaming-in of poly_int.

Signed-off-by: Prathamesh Kulkarni <prathame...@nvidia.com>

diff --git a/gcc/data-streamer-in.cc b/gcc/data-streamer-in.cc
index 7dce2928ef0..e18c6462316 100644
--- a/gcc/data-streamer-in.cc
+++ b/gcc/data-streamer-in.cc
@@ -183,9 +183,7 @@ poly_uint64
 streamer_read_poly_uint64 (class lto_input_block *ib)
 {
   poly_uint64 res;
-  for (unsigned int i = 0; i < NUM_POLY_INT_COEFFS; ++i)
-    res.coeffs[i] = streamer_read_uhwi (ib);
-  return res;
+  POLY_INT_READ_COMMON(res, streamer_read_uhwi (ib))
 }
 
 /* Read a poly_int64 from IB.  */
@@ -194,9 +192,7 @@ poly_int64
 streamer_read_poly_int64 (class lto_input_block *ib)
 {
   poly_int64 res;
-  for (unsigned int i = 0; i < NUM_POLY_INT_COEFFS; ++i)
-    res.coeffs[i] = streamer_read_hwi (ib);
-  return res;
+  POLY_INT_READ_COMMON(res, streamer_read_hwi (ib))
 }
 
 /* Read gcov_type value from IB.  */
diff --git a/gcc/data-streamer.cc b/gcc/data-streamer.cc
index 346b294c72a..d2e9634d62f 100644
--- a/gcc/data-streamer.cc
+++ b/gcc/data-streamer.cc
@@ -28,6 +28,12 @@ along with GCC; see the file COPYING3.  If not see
 #include "cgraph.h"
 #include "data-streamer.h"
 
+/* While streaming-out, host NUM_POLY_INT_COEFFS is stored at beginning
+   of mode_table. While streaming-in, the value is read in
+   host_num_poly_int_coeffs.  */
+
+unsigned host_num_poly_int_coeffs;
+
 /* Pack WORK into BP in a variant of uleb format.  */
 
 void
diff --git a/gcc/data-streamer.h b/gcc/data-streamer.h
index 6a2596134ce..3b26075c79f 100644
--- a/gcc/data-streamer.h
+++ b/gcc/data-streamer.h
@@ -50,6 +50,7 @@ void bp_pack_real_value (struct bitpack_d *, const 
REAL_VALUE_TYPE *);
 void bp_unpack_real_value (struct bitpack_d *, REAL_VALUE_TYPE *);
 unsigned HOST_WIDE_INT bp_unpack_var_len_unsigned (struct bitpack_d *);
 HOST_WIDE_INT bp_unpack_var_len_int (struct bitpack_d *);
+extern unsigned host_num_poly_int_coeffs;
 
 /* In data-streamer-out.cc  */
 void streamer_write_zero (struct output_block *);
@@ -194,15 +195,51 @@ bp_unpack_value (struct bitpack_d *bp, unsigned nbits)
   return val & mask;
 }
 
+/* Common code for reading poly_int.
+   FIXME: streamer_read_coeff is an (ugly) kludge, it relies on the caller
+   passing a "function call" like bp_unpack_value (bp, nbits) or
+   streamer_read_uhwi (ib) which will read the next coeff from respective 
stream.
+   I am not sure if we could use a callback because streaming functions
+   streamer_read_[u]hwi, bp_unpack_value have different signatures.  */
+
+#define POLY_INT_READ_COMMON(x, streamer_read_coeff)                   \
+{                                                                      \
+  unsigned i;                                                          \
+                                                                       \
+  if (host_num_poly_int_coeffs <= NUM_POLY_INT_COEFFS)                 \
+    {                                                                  \
+      for (i = 0; i < host_num_poly_int_coeffs; i++)                   \
+       x.coeffs[i] = streamer_read_coeff;                              \
+      for (; i < NUM_POLY_INT_COEFFS; i++)                             \
+       x.coeffs[i] = 0;                                                \
+    }                                                                  \
+  else                                                                 \
+    {                                                                  \
+      for (i = 0; i < NUM_POLY_INT_COEFFS; i++)                                
\
+       x.coeffs[i] = streamer_read_coeff;                              \
+                                                                       \
+      /* Ensure remaining coeffs are zero.  */                         \
+      for (; i < host_num_poly_int_coeffs; i++)                                
\
+       {                                                               \
+         __typeof(x.coeffs[0]) val = streamer_read_coeff;              \
+         if (val != 0)                                                 \
+           fatal_error (input_location,                                \
+                        "Degree of %<poly_int%> exceeds "              \
+                        "%<NUM_POLY_INT_COEFFS%> (%d)",                \
+                        NUM_POLY_INT_COEFFS);                          \
+       }                                                               \
+    }                                                                  \
+                                                                       \
+  return x;                                                            \
+}
+
 /* Unpacks a polynomial value from the bit-packing context BP in which each
    coefficient has NBITS bits.  */
 inline poly_int<NUM_POLY_INT_COEFFS, bitpack_word_t>
 bp_unpack_poly_value (struct bitpack_d *bp, unsigned nbits)
 {
   poly_int<NUM_POLY_INT_COEFFS, bitpack_word_t> x;
-  for (int i = 0; i < NUM_POLY_INT_COEFFS; ++i)
-    x.coeffs[i] = bp_unpack_value (bp, nbits);
-  return x;
+  POLY_INT_READ_COMMON(x, bp_unpack_value (bp, nbits))
 }
 
 
diff --git a/gcc/lto-streamer-in.cc b/gcc/lto-streamer-in.cc
index 2e592be8082..3e2c786fc36 100644
--- a/gcc/lto-streamer-in.cc
+++ b/gcc/lto-streamer-in.cc
@@ -2013,6 +2013,9 @@ lto_input_mode_table (struct lto_file_decl_data 
*file_data)
                                header->string_size, vNULL);
   bitpack_d bp = streamer_read_bitpack (&ib);
 
+  host_num_poly_int_coeffs
+    = bp_unpack_value (&bp, MAX_NUM_POLY_INT_COEFFS_BITS);
+
   unsigned mode_bits = bp_unpack_value (&bp, 5);
   unsigned char *table = ggc_cleared_vec_alloc<unsigned char> (1 << mode_bits);
 
diff --git a/gcc/lto-streamer-out.cc b/gcc/lto-streamer-out.cc
index c329ac8af95..091e4126965 100644
--- a/gcc/lto-streamer-out.cc
+++ b/gcc/lto-streamer-out.cc
@@ -3192,6 +3192,8 @@ lto_write_mode_table (void)
   ob = create_output_block (LTO_section_mode_table);
   bitpack_d bp = bitpack_create (ob->main_stream);
 
+  bp_pack_value (&bp, NUM_POLY_INT_COEFFS, MAX_NUM_POLY_INT_COEFFS_BITS);
+
   /* Ensure that for GET_MODE_INNER (m) != m we have
      also the inner mode marked.  */
   for (int i = 0; i < (int) MAX_MACHINE_MODE; i++)
diff --git a/gcc/poly-int.h b/gcc/poly-int.h
index e3f8d4df716..8d3e6098f0b 100644
--- a/gcc/poly-int.h
+++ b/gcc/poly-int.h
@@ -354,6 +354,10 @@ struct poly_result<T1, T2, 2>
    ? (void) ((RES).coeffs[I] = VALUE) \
    : (void) ((RES).coeffs[I].~C (), new (&(RES).coeffs[I]) C (VALUE)))
 
+/* Number of bits needed to represent maximum value of
+   NUM_POLY_INT_COEFFS defined by any target.  */
+#define MAX_NUM_POLY_INT_COEFFS_BITS   (2)
+
 /* poly_int_full and poly_int_hungry are used internally within poly_int
    for delegated initializers.  poly_int_full indicates that a parameter
    pack has enough elements to initialize every coefficient.  poly_int_hungry
diff --git a/gcc/tree-streamer-in.cc b/gcc/tree-streamer-in.cc
index c248a74f7a1..c41803aa21e 100644
--- a/gcc/tree-streamer-in.cc
+++ b/gcc/tree-streamer-in.cc
@@ -671,8 +671,29 @@ static void
 lto_input_ts_poly_tree_pointers (class lto_input_block *ib,
                                 class data_in *data_in, tree expr)
 {
-  for (unsigned int i = 0; i < NUM_POLY_INT_COEFFS; ++i)
-    POLY_INT_CST_COEFF (expr, i) = stream_read_tree_ref (ib, data_in);
+  unsigned i;
+  if (host_num_poly_int_coeffs <= NUM_POLY_INT_COEFFS)
+    {
+      for (i = 0; i < host_num_poly_int_coeffs; i++)
+       POLY_INT_CST_COEFF (expr, i) = stream_read_tree_ref (ib, data_in);
+
+      tree coeff_type = TREE_TYPE (POLY_INT_CST_COEFF (expr, 0));
+      for (; i < NUM_POLY_INT_COEFFS; i++)
+       POLY_INT_CST_COEFF (expr, i) = build_zero_cst (coeff_type);
+    }
+  else
+    {
+      for (i = 0; i < NUM_POLY_INT_COEFFS; i++)
+       POLY_INT_CST_COEFF (expr, i) = stream_read_tree_ref (ib, data_in);
+      for (; i < host_num_poly_int_coeffs; i++)
+       {
+         tree val = stream_read_tree_ref (ib, data_in);
+         if (!integer_zerop (val))
+           fatal_error (input_location,
+                        "Degree of %<poly_int%> exceeds "
+                        "%<NUM_POLY_INT_COEFFS%>");
+       }
+    }
 }

RE: Support streaming of poly_int for offloading when it's degree <= accel's NUM_POLY_INT_COEFFS

Reply via email to