On Thu, 29 Sep 2016, Kyrill Tkachov wrote:

> Hi Richard,
> 
> Thanks for the detailed comments, I'll be working on addressing them.
> Below are answers to some of your questions:
> 
> On 29/09/16 11:45, Richard Biener wrote:
> > 
> > +
> > +  /* If we're inserting a non-bytesized width or not at a byte boundary
> > +     use an intermediate wide_int to perform the bit-insertion correctly.
> > */
> > +  if (sub_byte_op_p
> > +      || (TREE_CODE (expr) == INTEGER_CST
> > +     && mode_for_size (bitlen, MODE_INT, 0) == BLKmode))
> > I wonder when we have BLKmode here ...
> > 
> > > +    {
> > > +      unsigned int byte_size = last_byte - first_byte + 1;
> > > +
> > > +      /* The functions native_encode_expr/native_interpret_expr uses the
> > > +  TYPE_MODE of the type to determine the number of bytes to write/read
> > > +  so if we want to process a number of bytes that does not have a
> > > +  TYPE_MODE of equal size we need to use a type that has a valid mode
> > > +  for it.  */
> > > +
> > > +      machine_mode mode
> > > + = smallest_mode_for_size (byte_size * BITS_PER_UNIT, MODE_INT);
> > > +      tree dest_int_type
> > > + = build_nonstandard_integer_type (GET_MODE_BITSIZE (mode), UNSIGNED);
> > > +      byte_size = GET_MODE_SIZE (mode);
> > ... how we ever get non-BLKmode here.
> 
> smallest_mode_for_size is guaranteed to never return BLKmode.
> It returns a mode that contains at least the requested number of bits.
> mode_for_size returns BLKmode if no mode fits the exact number of bits.

But then it's really about GET_MODE_SIZE (TYPE_MODE (expr)) vs.
TYPE_PRECISION (expr).  I see no need to invoke [smallest_]mode_for_size.

> > 
> > > +      /* The region from the byte array that we're inserting into.  */
> > > +      tree ptr_wide_int
> > > + = native_interpret_expr (dest_int_type, ptr + first_byte,
> > > +                          total_bytes);
> > > +
> > > +      gcc_assert (ptr_wide_int);
> > > +      wide_int dest_wide_int
> > > + = wi::to_wide (ptr_wide_int, TYPE_PRECISION (dest_int_type));
> > > +      wide_int expr_wide_int
> > > + = wi::to_wide (tmp_int, byte_size * BITS_PER_UNIT);
> > > +      if (BYTES_BIG_ENDIAN)
> > > + {
> > > +   unsigned int insert_pos
> > > +     = byte_size * BITS_PER_UNIT - bitlen - (bitpos % BITS_PER_UNIT);
> > > +   dest_wide_int
> > > +     = wi::insert (dest_wide_int, expr_wide_int, insert_pos, bitlen);
> > > + }
> > > +      else
> > > + dest_wide_int = wi::insert (dest_wide_int, expr_wide_int,
> > > +                             bitpos % BITS_PER_UNIT, bitlen);
> > > +
> > > +      tree res = wide_int_to_tree (dest_int_type, dest_wide_int);
> > > +      native_encode_expr (res, ptr + first_byte, total_bytes, 0);
> > > +
> > OTOH this whole dance looks as complicated and way more expensive than
> > using native_encode_expr into a temporary buffern and then a
> > manually implemented "bit-merging" of it at ptr + first_byte + bitpos.
> > AFAICS that operation is even endianess agnostic.
> 
> I did try implementing that, but it kept blowing up in big-endian.

I'm really curious how ;)  And I'm willing to help debugging in
case you have sth basic working.  I suggest to work on a byte
granularity here (bytes have no endianess!).

> I found it to be very fiddly. I can try again, but we'll see how it goes...
> 
> <snip>
> 
> > > +
> > > +bool
> > > +pass_store_merging::terminate_and_process_all_chains (basic_block bb)
> > > +{
> > > +  hash_map<tree_operand_hash, struct imm_store_chain_info *>::iterator
> > > iter
> > > +    = m_stores.begin ();
> > > +  bool ret = false;
> > > +  for (; iter != m_stores.end (); ++iter)
> > > +    ret |= terminate_and_release_chain ((*iter).first);
> > > +
> > > +  if (ret)
> > > +    gimple_purge_dead_eh_edges (bb);
> > Why do you need this?  I don't see how merging stores should effect EH
> > edges at all -- unless you are removing EH side-effects (ISTR you
> > are not merging cross-BB).
> 
> I was seeing a testsuite failure in the C++ testsuite,
> an ICE about EH edges. However, when I rerun the testsuite today
> without the gimple_purge_dead_eh_edges call I don't see any fallout...
> I have been rebasing the patch against trunk regularly so maybe something
> change in the underlying trunk since then...

I'd doubt that.  I don't see any check for whether a store may throw
internally so that missing check certainly would explain things.
You'd need -fnon-call-exceptions but you also need to have an
interesting to achieve setup where you have one store of a group marked as
not throwing and one as throwing.

I'd say throw some stmt_can_throw_internal in and mark such stores as
invalid and remove the edge purging.  That should do the trick.

> <snip>
> 
> > +
> > +bool
> > +imm_store_chain_info::coalesce_immediate_stores ()
> > +{
> > +  /* Anything less can't be processed.  */
> > +  if (m_store_info.length () < 2)
> > +    return false;
> > +
> > +  if (dump_file && (dump_flags & TDF_DETAILS))
> > +    fprintf (dump_file, "Attempting to coalesce %u stores in chain.\n",
> > +        m_store_info.length ());
> > +
> > +  store_immediate_info *info;
> > +  unsigned int i;
> > +
> > +  /* Order the stores by the bitposition they write to.  */
> > +  m_store_info.qsort (sort_by_bitpos);
> > +
> > +  info = m_store_info[0];
> > +  merged_store_group *merged_store = new merged_store_group (info);
> > +  m_merged_store_groups.safe_push (merged_store);
> > +
> > +  FOR_EACH_VEC_ELT (m_store_info, i, info)
> > +    {
> > +      if (dump_file)
> > +   {
> > +     fprintf (dump_file, "Store %u:\nbitsize:" HOST_WIDE_INT_PRINT_DEC
> > +                         " bitpos:" HOST_WIDE_INT_PRINT_DEC " val:\n",
> > +              i, info->bitsize, info->bitpos);
> > +     print_generic_expr (dump_file, info->val, 0);
> > +     fprintf (dump_file, "\n------------\n");
> > +   }
> > +
> > +      if (i == 0)
> > +   continue;
> > +
> > +      /* |---store 1---|
> > +          |---store 2---|
> > +       Overlapping stores.  */
> > +      unsigned HOST_WIDE_INT start = info->bitpos;
> > +      if (IN_RANGE (start, merged_store->start,
> > +               merged_store->start + merged_store->width - 1))
> > +   {
> > +     merged_store->merge_overlapping (info);
> > +     continue;
> > +   }
> > +
> > +      /* |---store 1---| <gap> |---store 2---|.
> > +    Gap between stores.  Start a new group.  */
> > +      if (start != merged_store->start + merged_store->width)
> > +   {
> > +     merged_store = new merged_store_group (info);
> > +     m_merged_store_groups.safe_push (merged_store);
> > So this will create merged_store_group even for single-store groups
> > (all stores have gaps between them)?  Can't we optimize this maybe
> > common case to not perform the memory allocation?  Like handle
> > this case first and lazily allocate a new group.  Looks like
> > apply_stores also doesn't bail out early for single-element groups.
> 
> Yeah, we can fast-path single-store groups.
> 
> > > +
> > > +      num_stmts++;
> > > +
> > > +      try_pos += try_size / BITS_PER_UNIT;
> > > +
> > > +      wi_offset += try_size;
> > > +      size -= try_size;
> > > +      align = try_size;
> > > +      while (size < try_size)
> > > + try_size /= 2;
> > > +
> > > +      if (num_stmts >= orig_num_stmts)
> > > + {
> > Hmm.  I think that if you'd end up with an unchanged stmt simply leave it
> > there.  I don't see how you'd ever end up with more stmts than in the
> > original source?
> 
> We don't ever end up with more statements, this could
> be if (num_stmts == orig_num_stmts) to catch the case
> where we can't improve on the original statement count.
>
> > 
> > > +    return false;
> > > +
> > > +  return code != VOID_CST;
> > Huh, do you really hit this?
> 
> Not really, I just looked through tree.def on what tree codes
> were tcc_constant and removed those that were handled by native_encode_expr.
> I can remove this.
> 
> > > +                      || !rhs_valid_for_store_merging_p (rhs)
> > > +                      || TREE_THIS_VOLATILE (base_addr);
> > TREE_THIS_VOLATILE is handled above already.
> 
> I found that it doesn't and caused various issues.
> While the stmt didn't trigger gimple_has_volatile_ops the result of
> get_inner reference could be TREE_THIS_VOLATILE. Not doing this check
> caused havoc later on (duplicate entries in the hash table for the same
> tree because apparently volatile trees don't hash to the same location).

Yes, you can have a MEM[&volatile-decl] that is actually not a volatile
access but get_inner_reference will return volatile-decl as base
(now marked TREE_THIS_VOLATILE).  The hashing will not consider those
equal.  I suppose this case doesn't matter enough to be worth
"fixing" in some way but please add a comment above the TREE_THIS_VOLATILE
check about its reason.

Thanks,
Richard.

Reply via email to