Re: [PATCH][v4] GIMPLE store merging pass

Kyrill Tkachov Thu, 29 Sep 2016 08:18:42 -0700

Hi Richard,

Thanks for the detailed comments, I'll be working on addressing them.
Below are answers to some of your questions:


On 29/09/16 11:45, Richard Biener wrote:


+
+  /* If we're inserting a non-bytesized width or not at a byte boundary
+     use an intermediate wide_int to perform the bit-insertion correctly.  */
+  if (sub_byte_op_p
+      || (TREE_CODE (expr) == INTEGER_CST
+         && mode_for_size (bitlen, MODE_INT, 0) == BLKmode))
I wonder when we have BLKmode here ...

+    {
+      unsigned int byte_size = last_byte - first_byte + 1;
+
+      /* The functions native_encode_expr/native_interpret_expr uses the
+        TYPE_MODE of the type to determine the number of bytes to write/read
+        so if we want to process a number of bytes that does not have a
+        TYPE_MODE of equal size we need to use a type that has a valid mode
+        for it.  */
+
+      machine_mode mode
+       = smallest_mode_for_size (byte_size * BITS_PER_UNIT, MODE_INT);
+      tree dest_int_type
+       = build_nonstandard_integer_type (GET_MODE_BITSIZE (mode), UNSIGNED);
+      byte_size = GET_MODE_SIZE (mode);

... how we ever get non-BLKmode here.


smallest_mode_for_size is guaranteed to never return BLKmode.
It returns a mode that contains at least the requested number of bits.
mode_for_size returns BLKmode if no mode fits the exact number of bits.

+      /* The region from the byte array that we're inserting into.  */
+      tree ptr_wide_int
+       = native_interpret_expr (dest_int_type, ptr + first_byte,
+                                total_bytes);
+
+      gcc_assert (ptr_wide_int);
+      wide_int dest_wide_int
+       = wi::to_wide (ptr_wide_int, TYPE_PRECISION (dest_int_type));
+      wide_int expr_wide_int
+       = wi::to_wide (tmp_int, byte_size * BITS_PER_UNIT);
+      if (BYTES_BIG_ENDIAN)
+       {
+         unsigned int insert_pos
+           = byte_size * BITS_PER_UNIT - bitlen - (bitpos % BITS_PER_UNIT);
+         dest_wide_int
+           = wi::insert (dest_wide_int, expr_wide_int, insert_pos, bitlen);
+       }
+      else
+       dest_wide_int = wi::insert (dest_wide_int, expr_wide_int,
+                                   bitpos % BITS_PER_UNIT, bitlen);
+
+      tree res = wide_int_to_tree (dest_int_type, dest_wide_int);
+      native_encode_expr (res, ptr + first_byte, total_bytes, 0);
+

OTOH this whole dance looks as complicated and way more expensive than
using native_encode_expr into a temporary buffern and then a
manually implemented "bit-merging" of it at ptr + first_byte + bitpos.
AFAICS that operation is even endianess agnostic.


I did try implementing that, but it kept blowing up in big-endian.
I found it to be very fiddly. I can try again, but we'll see how it goes...

<snip>

+
+bool
+pass_store_merging::terminate_and_process_all_chains (basic_block bb)
+{
+  hash_map<tree_operand_hash, struct imm_store_chain_info *>::iterator iter
+    = m_stores.begin ();
+  bool ret = false;
+  for (; iter != m_stores.end (); ++iter)
+    ret |= terminate_and_release_chain ((*iter).first);
+
+  if (ret)
+    gimple_purge_dead_eh_edges (bb);

Why do you need this?  I don't see how merging stores should effect EH
edges at all -- unless you are removing EH side-effects (ISTR you
are not merging cross-BB).


I was seeing a testsuite failure in the C++ testsuite,
an ICE about EH edges. However, when I rerun the testsuite today
without the gimple_purge_dead_eh_edges call I don't see any fallout...
I have been rebasing the patch against trunk regularly so maybe something
change in the underlying trunk since then...

<snip>

+
+bool
+imm_store_chain_info::coalesce_immediate_stores ()
+{
+  /* Anything less can't be processed.  */
+  if (m_store_info.length () < 2)
+    return false;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, "Attempting to coalesce %u stores in chain.\n",
+            m_store_info.length ());
+
+  store_immediate_info *info;
+  unsigned int i;
+
+  /* Order the stores by the bitposition they write to.  */
+  m_store_info.qsort (sort_by_bitpos);
+
+  info = m_store_info[0];
+  merged_store_group *merged_store = new merged_store_group (info);
+  m_merged_store_groups.safe_push (merged_store);
+
+  FOR_EACH_VEC_ELT (m_store_info, i, info)
+    {
+      if (dump_file)
+       {
+         fprintf (dump_file, "Store %u:\nbitsize:" HOST_WIDE_INT_PRINT_DEC
+                             " bitpos:" HOST_WIDE_INT_PRINT_DEC " val:\n",
+                  i, info->bitsize, info->bitpos);
+         print_generic_expr (dump_file, info->val, 0);
+         fprintf (dump_file, "\n------------\n");
+       }
+
+      if (i == 0)
+       continue;
+
+      /* |---store 1---|
+              |---store 2---|
+       Overlapping stores.  */
+      unsigned HOST_WIDE_INT start = info->bitpos;
+      if (IN_RANGE (start, merged_store->start,
+                   merged_store->start + merged_store->width - 1))
+       {
+         merged_store->merge_overlapping (info);
+         continue;
+       }
+
+      /* |---store 1---| <gap> |---store 2---|.
+        Gap between stores.  Start a new group.  */
+      if (start != merged_store->start + merged_store->width)
+       {
+         merged_store = new merged_store_group (info);
+         m_merged_store_groups.safe_push (merged_store);
So this will create merged_store_group even for single-store groups
(all stores have gaps between them)?  Can't we optimize this maybe
common case to not perform the memory allocation?  Like handle
this case first and lazily allocate a new group.  Looks like
apply_stores also doesn't bail out early for single-element groups.


Yeah, we can fast-path single-store groups.

+
+      num_stmts++;
+
+      try_pos += try_size / BITS_PER_UNIT;
+
+      wi_offset += try_size;
+      size -= try_size;
+      align = try_size;
+      while (size < try_size)
+       try_size /= 2;
+
+      if (num_stmts >= orig_num_stmts)
+       {

Hmm.  I think that if you'd end up with an unchanged stmt simply leave it
there.  I don't see how you'd ever end up with more stmts than in the
original source?


We don't ever end up with more statements, this could
be if (num_stmts == orig_num_stmts) to catch the case
where we can't improve on the original statement count.

+    return false;
+
+  return code != VOID_CST;

Huh, do you really hit this?


Not really, I just looked through tree.def on what tree codes
were tcc_constant and removed those that were handled by native_encode_expr.
I can remove this.

+                            || !rhs_valid_for_store_merging_p (rhs)
+                            || TREE_THIS_VOLATILE (base_addr);

TREE_THIS_VOLATILE is handled above already.


I found that it doesn't and caused various issues.
While the stmt didn't trigger gimple_has_volatile_ops the result of
get_inner reference could be TREE_THIS_VOLATILE. Not doing this check
caused havoc later on (duplicate entries in the hash table for the same
tree because apparently volatile trees don't hash to the same location).

Thanks,
Kyrill

Re: [PATCH][v4] GIMPLE store merging pass

Reply via email to