Re: [PATCH][v4] GIMPLE store merging pass

Richard Biener Thu, 29 Sep 2016 03:45:53 -0700

On Wed, 28 Sep 2016, Kyrill Tkachov wrote:

> Hi all,
> 
> This is v4 of the pass. It addresses feedback by Bernhard, including typo
> fixes and
> skipping of debug statements.
> Also, I've extended it to handle the case from PR 23684 and included that
> testcase
> in the patch.  Merging now triggers more often.
> I've also added purging of dead EH edges that was missing from the previous
> versions.
> 
> Bootstrapped and tested on aarch64-none-linux-gnu, x86_64-unknown-linux-gnu,
> arm-none-linux-gnueabihf.
> Also tested on aarch64 big-endian.
> 
> I saw no regressions on my x86_64 machine on SPEC2006. I think the changes in
> individual benchmarks were
> in the noise though I think the x86_64 expanders could be improved to split
> expensive movabsq instructions
> into two movl ones (I think).
> 
> Bill, could you or someone else with access to Power benchmarking try this
> patch out on some benchmarks
> that you usually use? The new pass in this patch is on by default and can be
> turned off by -fno-store-merging
> if needed.  Jakub indicated that his last attempt at this work caused
> regressions on powerpc so I'd like to
> see if this patch is okay in that regard.


Comments below.

> Thanks,
> Kyrill
> 
> 2016-09-28  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>
> 
>     PR middle-end/22141
>     * Makefile.in (OBJS): Add gimple-ssa-store-merging.o.
>     * common.opt (fstore-merging): New Optimization option.
>     * opts.c (default_options_table): Add entry for
>     OPT_ftree_store_merging.
>     * params.def (PARAM_STORE_MERGING_ALLOW_UNALIGNED): Define.
>     * passes.def: Insert pass_tree_store_merging.
>     * tree-pass.h (make_pass_store_merging): Declare extern
>     prototype.
>     * gimple-ssa-store-merging.c: New file.
>     * doc/invoke.texi (Optimization Options): Document
>     -fstore-merging.
> 
> 2016-09-28  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>
>             Jakub Jelinek  <ja...@redhat.com>
>             Andrew Pinski  <pins...@gmail.com>
> 
>     PR middle-end/22141
>     PR rtl-optimization/23684
>     * gcc.c-torture/execute/pr22141-1.c: New test.
>     * gcc.c-torture/execute/pr22141-2.c: Likewise.
>     * gcc.target/aarch64/ldp_stp_1.c: Adjust for -fstore-merging.
>     * gcc.target/aarch64/ldp_stp_4.c: Likewise.
>     * gcc.dg/store_merging_1.c: New test.
>     * gcc.dg/store_merging_2.c: Likewise.
>     * gcc.dg/store_merging_3.c: Likewise.
>     * gcc.dg/store_merging_4.c: Likewise.
>     * gcc.dg/store_merging_5.c: Likewise.
>     * gcc.dg/store_merging_6.c: Likewise.
>     * gcc.dg/store_merging_7.c: Likewise.
>     * gcc.target/i386/pr22141.c: Likewise.
>     * gcc.target/i386/pr34012.c: Add -fno-store-merging to dg-options.
> 
> commit 74e3e742e446f6a892c8ee755f640756efc816cd
> Author: Kyrylo Tkachov <kyrylo.tkac...@arm.com>
> Date:   Tue Jul 12 12:30:47 2016 +0100
> 
>     GIMPLE store merging pass
> 
> diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> index 9e08fbf..d83b2f1 100644
> --- a/gcc/Makefile.in
> +++ b/gcc/Makefile.in
> @@ -1300,6 +1300,7 @@ OBJS = \
>       gimple-ssa-isolate-paths.o \
>       gimple-ssa-nonnull-compare.o \
>       gimple-ssa-split-paths.o \
> +     gimple-ssa-store-merging.o \
>       gimple-ssa-strength-reduction.o \
>       gimple-ssa-sprintf.o \
>       gimple-streamer-in.o \
> diff --git a/gcc/common.opt b/gcc/common.opt
> index 168735e..4d73852 100644
> --- a/gcc/common.opt
> +++ b/gcc/common.opt
> @@ -1460,6 +1460,10 @@ fstrict-volatile-bitfields
>  Common Report Var(flag_strict_volatile_bitfields) Init(-1) Optimization
>  Force bitfield accesses to match their type width.
>  
> +fstore-merging
> +Common Var(flag_store_merging) Optimization
> +Use the tree store merging pass.
> +
>  fguess-branch-probability
>  Common Report Var(flag_guess_branch_prob) Optimization
>  Enable guessing of branch probabilities.
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 6767462..83acd3b 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -403,7 +403,7 @@ Objective-C and Objective-C++ Dialects}.
>  -fsingle-precision-constant -fsplit-ivs-in-unroller @gol
>  -fsplit-paths @gol
>  -fsplit-wide-types -fssa-backprop -fssa-phiopt @gol
> --fstdarg-opt -fstrict-aliasing @gol
> +-fstdarg-opt -fstore-merging -fstrict-aliasing @gol
>  -fstrict-overflow -fthread-jumps -ftracer -ftree-bit-ccp @gol
>  -ftree-builtin-call-dce -ftree-ccp -ftree-ch @gol
>  -ftree-coalesce-vars -ftree-copy-prop -ftree-dce -ftree-dominator-opts @gol
> @@ -414,8 +414,8 @@ Objective-C and Objective-C++ Dialects}.
>  -ftree-loop-vectorize @gol
>  -ftree-parallelize-loops=@var{n} -ftree-pre -ftree-partial-pre -ftree-pta 
> @gol
>  -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol
> --ftree-switch-conversion -ftree-tail-merge -ftree-ter @gol
> --ftree-vectorize -ftree-vrp -funconstrained-commons @gol
> +-ftree-switch-conversion -ftree-tail-merge @gol
> +-ftree-ter -ftree-vectorize -ftree-vrp -funconstrained-commons @gol
>  -funit-at-a-time -funroll-all-loops -funroll-loops @gol
>  -funsafe-math-optimizations -funswitch-loops @gol
>  -fipa-ra -fvariable-expansion-in-unroller -fvect-cost-model -fvpt @gol
> @@ -6593,6 +6593,7 @@ compilation time.
>  -fsplit-wide-types @gol
>  -fssa-backprop @gol
>  -fssa-phiopt @gol
> +-fstore-merging @gol
>  -ftree-bit-ccp @gol
>  -ftree-ccp @gol
>  -ftree-ch @gol
> @@ -7927,6 +7928,13 @@ Perform scalar replacement of aggregates.  This pass 
> replaces structure
>  references with scalars to prevent committing structures to memory too
>  early.  This flag is enabled by default at @option{-O} and higher.
>  
> +@item -fstore-merging
> +@opindex fstore-merging
> +Perform merging of narrow stores to consecutive memory addresses.  This pass
> +merges contiguous stores of immediate values narrower than a word into fewer
> +wider stores to reduce the number of instructions.  This is enabled by 
> default
> +at @option{-O} and higher.
> +
>  @item -ftree-ter
>  @opindex ftree-ter
>  Perform temporary expression replacement during the SSA->normal phase.  
> Single
> diff --git a/gcc/gimple-ssa-store-merging.c b/gcc/gimple-ssa-store-merging.c
> new file mode 100644
> index 0000000..c77933d
> --- /dev/null
> +++ b/gcc/gimple-ssa-store-merging.c
> @@ -0,0 +1,1113 @@
> +/* GIMPLE store merging pass.
> +   Copyright (C) 2016 Free Software Foundation, Inc.
> +   Contributed by ARM Ltd.
> +
> +   This file is part of GCC.
> +
> +   GCC is free software; you can redistribute it and/or modify it
> +   under the terms of the GNU General Public License as published by
> +   the Free Software Foundation; either version 3, or (at your option)
> +   any later version.
> +
> +   GCC is distributed in the hope that it will be useful, but
> +   WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   General Public License for more details.
> +
> +   You should have received a copy of the GNU General Public License
> +   along with GCC; see the file COPYING3.  If not see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +/* The purpose of this pass is to combine multiple memory stores of
> +   constant values to consecutive memory locations into fewer wider stores.
> +   For example, if we have a sequence peforming four byte stores to
> +   consecutive memory locations:
> +   [p     ] := imm1;
> +   [p + 1B] := imm2;
> +   [p + 2B] := imm3;
> +   [p + 3B] := imm4;
> +   we can transform this into a single 4-byte store if the target supports 
> it:
> +  [p] := imm1:imm2:imm3:imm4 //concatenated immediates according to 
> endianness.
> +
> +   The algorithm is applied to each basic block in three phases:
> +
> +   1) Scan through the basic block recording constant assignments to
> +   destinations that can be expressed as a store to memory of a certain size
> +   at a certain bit offset.  Record store chains to different bases in a
> +   hash_map (m_stores) and make sure to terminate such chains when 
> appropriate
> +   (for example when when the stored values get used subsequently).
> +   These stores can be a result of structure element initializers, array 
> stores
> +   etc.  A store_immediate_info object is recorded for every such store.
> +   Record as many such assignments to a single base as possible until a
> +   statement that interferes with the store sequence is encountered.
> +
> +   2) Analyze the chain of stores recorded in phase 1) (i.e. the vector of
> +   store_immediate_info objects) and coalesce contiguous stores into
> +   merged_store_group objects.
> +
> +   For example, given the stores:
> +   [p     ] := 0;
> +   [p + 1B] := 1;
> +   [p + 3B] := 0;
> +   [p + 4B] := 1;
> +   [p + 5B] := 0;
> +   [p + 6B] := 0;
> +   This phase would produce two merged_store_group objects, one recording the
> +   two bytes stored in the memory region [p : p + 1] and another
> +   recording the four bytes stored in the memory region [p + 3 : p + 6].
> +
> +   3) The merged_store_group objects produced in phase 2) are processed
> +   to generate the sequence of wider stores that set the contiguous memory
> +   regions to the sequence of bytes that correspond to it.  This may emit
> +   multiple stores per store group to handle contiguous stores that are not
> +   of a size that is a power of 2.  For example it can try to emit a 40-bit
> +   store as a 32-bit store followed by an 8-bit store.
> +   We try to emit as wide stores as we can while respecting STRICT_ALIGNMENT 
> or
> +   SLOW_UNALIGNED_ACCESS rules.
> +
> +   Note on endianness and example:
> +   Consider 2 contiguous 16-bit stores followed by 2 contiguous 8-bit stores:
> +   [p     ] := 0x1234;
> +   [p + 2B] := 0x5678;
> +   [p + 4B] := 0xab;
> +   [p + 5B] := 0xcd;
> +
> +   The memory layout for little-endian (LE) and big-endian (BE) must be:
> +  p |LE|BE|
> +  ---------
> +  0 |34|12|
> +  1 |12|34|
> +  2 |78|56|
> +  3 |56|78|
> +  4 |ab|ab|
> +  5 |cd|cd|
> +
> +  To merge these into a single 48-bit merged value 'val' in phase 2)
> +  on little-endian we insert stores to higher (consecutive) bitpositions
> +  into the most significant bits of the merged value.
> +  The final merged value would be: 0xcdab56781234
> +
> +  For big-endian we insert stores to higher bitpositions into the least
> +  significant bits of the merged value.
> +  The final merged value would be: 0x12345678abcd
> +
> +  Then, in phase 3), we want to emit this 48-bit value as a 32-bit store
> +  followed by a 16-bit store.  Again, we must consider endianness when
> +  breaking down the 48-bit value 'val' computed above.
> +  For little endian we emit:
> +  [p]      (32-bit) := 0x56781234; // val & 0x0000ffffffff;
> +  [p + 4B] (16-bit) := 0xcdab;    // (val & 0xffff00000000) >> 32;
> +
> +  Whereas for big-endian we emit:
> +  [p]      (32-bit) := 0x12345678; // (val & 0xffffffff0000) >> 16;
> +  [p + 4B] (16-bit) := 0xabcd;     //  val & 0x00000000ffff;  */
> +
> +#include "config.h"
> +#include "system.h"
> +#include "coretypes.h"
> +#include "backend.h"
> +#include "tree.h"
> +#include "gimple.h"
> +#include "builtins.h"
> +#include "fold-const.h"
> +#include "tree-pass.h"
> +#include "ssa.h"
> +#include "gimple-pretty-print.h"
> +#include "alias.h"
> +#include "fold-const.h"
> +#include "params.h"
> +#include "print-tree.h"
> +#include "tree-hash-traits.h"
> +#include "gimple-iterator.h"
> +#include "gimplify.h"
> +#include "stor-layout.h"
> +#include "tree-cfg.h"
> +#include "predict.h"
> +#include "target.h"
> +
> +/* The maximum size (in bits) of the stores this pass should generate.  */
> +#define MAX_STORE_BITSIZE (BITS_PER_WORD)
> +#define MAX_STORE_BYTES (MAX_STORE_BITSIZE / BITS_PER_UNIT)
> +
> +namespace {
> +
> +/* Struct recording the information about a single store of an immediate
> +   to memory.  These are created in the first phase and coalesced into
> +   merged_store_group objects in the second phase.  */
> +
> +struct store_immediate_info
> +{
> +  unsigned HOST_WIDE_INT bitsize;
> +  unsigned HOST_WIDE_INT bitpos;
> +  tree val;
> +  tree dest;
> +  gimple *stmt;
> +  unsigned int order;
> +  store_immediate_info (unsigned HOST_WIDE_INT, unsigned HOST_WIDE_INT, tree,
> +                     tree, gimple *, unsigned int);
> +};
> +
> +store_immediate_info::store_immediate_info (unsigned HOST_WIDE_INT bs,
> +                                         unsigned HOST_WIDE_INT bp, tree v,
> +                                         tree d, gimple *st,
> +                                         unsigned int ord)
> +  : bitsize (bs), bitpos (bp), val (v), dest (d), stmt (st), order (ord)
> +{
> +}
> +
> +/* Struct representing a group of stores to contiguous memory locations.
> +   These are produced by the second phase (coalescing) and consumed in the
> +   third phase that outputs the widened stores.  */
> +
> +struct merged_store_group
> +{
> +  unsigned HOST_WIDE_INT start;
> +  unsigned HOST_WIDE_INT width;
> +  /* The size of the allocated memory for val.  */
> +  unsigned HOST_WIDE_INT buf_size;
> +  unsigned char *val;
> +  unsigned int align;
> +  auto_vec<struct store_immediate_info *> stores;
> +  /* We record the first and last original statements in the sequence because
> +     we'll need their vuse/vdef and replacement position.  */
> +  gimple *last_stmt;
> +  unsigned int last_order;
> +  gimple *first_stmt;
> +  unsigned int first_order;

Please reorder fields to avoid padding on LP64 hosts.  I suppose
first/last stuff is not the first/last element in stores?  But
they should be in stores[] thus recording the index of the elements
might work?  Not sure if size of this struct matters though.

> +  merged_store_group (store_immediate_info *);
> +  ~merged_store_group ();
> +  void merge_into (store_immediate_info *);
> +  void merge_overlapping (store_immediate_info *);
> +  void apply_stores ();
> +};
> +
> +/* Debug helper.  Dump LEN elements of byte array PTR to FD in hex.  */
> +
> +static void
> +dump_char_array (FILE *fd, unsigned char *ptr, unsigned int len)
> +{
> +  if (!fd)
> +    return;
> +
> +  for (unsigned int i = 0; i < len; i++)
> +    fprintf (fd, "%x ", ptr[i]);
> +  fprintf (fd, "\n");
> +}
> +
> +/* Write BITLEN bits of EXPR to the byte array PTR at
> +   bit position BITPOS.  PTR should contain TOTAL_BYTES elements.  */
> +
> +static void
> +encode_tree_to_bitpos (tree expr, unsigned char *ptr, int bitlen, int bitpos,
> +                     unsigned int total_bytes)
> +{
> +  unsigned int last_byte = (bitpos + bitlen - 1) / BITS_PER_UNIT;
> +  unsigned int first_byte = bitpos / BITS_PER_UNIT;
> +  tree tmp_int = expr;
> +  bool sub_byte_op_p = (bitlen % BITS_PER_UNIT) || (bitpos % BITS_PER_UNIT);
> +  /* If the expression is not an integer encode it into the temporary buffer
> +     and read it back as an integer.  */
> +  if (TREE_CODE (expr) != INTEGER_CST && sub_byte_op_p)
> +    {
> +      unsigned char *tmpbuf = XALLOCAVEC (unsigned char, total_bytes);
> +      memset (tmpbuf, 0, total_bytes);
> +      native_encode_expr (expr, tmpbuf, total_bytes, 0);
> +      tree read_back_type
> +     = build_nonstandard_integer_type (tree_to_shwi (
> +             TYPE_SIZE (TREE_TYPE (expr))), UNSIGNED);
> +      tmp_int
> +     = native_interpret_expr (read_back_type, tmpbuf, total_bytes);
> +    }
> +
> +  gcc_assert (tmp_int);
> +
> +  /* If we're inserting a non-bytesized width or not at a byte boundary
> +     use an intermediate wide_int to perform the bit-insertion correctly.  */
> +  if (sub_byte_op_p
> +      || (TREE_CODE (expr) == INTEGER_CST
> +       && mode_for_size (bitlen, MODE_INT, 0) == BLKmode))

I wonder when we have BLKmode here ...

> +    {
> +      unsigned int byte_size = last_byte - first_byte + 1;
> +
> +      /* The functions native_encode_expr/native_interpret_expr uses the
> +      TYPE_MODE of the type to determine the number of bytes to write/read
> +      so if we want to process a number of bytes that does not have a
> +      TYPE_MODE of equal size we need to use a type that has a valid mode
> +      for it.  */
> +
> +      machine_mode mode
> +     = smallest_mode_for_size (byte_size * BITS_PER_UNIT, MODE_INT);
> +      tree dest_int_type
> +     = build_nonstandard_integer_type (GET_MODE_BITSIZE (mode), UNSIGNED);
> +      byte_size = GET_MODE_SIZE (mode);

... how we ever get non-BLKmode here.

> +      /* The region from the byte array that we're inserting into.  */
> +      tree ptr_wide_int
> +     = native_interpret_expr (dest_int_type, ptr + first_byte,
> +                              total_bytes);
> +
> +      gcc_assert (ptr_wide_int);
> +      wide_int dest_wide_int
> +     = wi::to_wide (ptr_wide_int, TYPE_PRECISION (dest_int_type));
> +      wide_int expr_wide_int
> +     = wi::to_wide (tmp_int, byte_size * BITS_PER_UNIT);
> +      if (BYTES_BIG_ENDIAN)
> +     {
> +       unsigned int insert_pos
> +         = byte_size * BITS_PER_UNIT - bitlen - (bitpos % BITS_PER_UNIT);
> +       dest_wide_int
> +         = wi::insert (dest_wide_int, expr_wide_int, insert_pos, bitlen);
> +     }
> +      else
> +     dest_wide_int = wi::insert (dest_wide_int, expr_wide_int,
> +                                 bitpos % BITS_PER_UNIT, bitlen);
> +
> +      tree res = wide_int_to_tree (dest_int_type, dest_wide_int);
> +      native_encode_expr (res, ptr + first_byte, total_bytes, 0);
> +

OTOH this whole dance looks as complicated and way more expensive than
using native_encode_expr into a temporary buffern and then a
manually implemented "bit-merging" of it at ptr + first_byte + bitpos.
AFAICS that operation is even endianess agnostic.

> +    }
> +  /* If we're inserting a "well-behaved" value at a normal position just
> +     call native_encode_expr directly.  */
> +  else
> +    native_encode_expr (tmp_int, ptr + first_byte,
> +                     total_bytes, 0);
> +}
> +
> +/* Sorting function for store_immediate_info objects.
> +   Sorts them by bitposition.  */
> +
> +static int
> +sort_by_bitpos (const void *x, const void *y)
> +{
> +  store_immediate_info *const *tmp = (store_immediate_info * const *) x;
> +  store_immediate_info *const *tmp2 = (store_immediate_info * const *) y;
> +
> +  if ((*tmp)->bitpos <= (*tmp2)->bitpos)
> +    return -1;
> +  else if ((*tmp)->bitpos > (*tmp2)->bitpos)
> +    return 1;
> +
> +  gcc_unreachable ();
> +  return 0;

drop the return 0

> +}
> +
> +/* Sorting function for store_immediate_info objects.
> +   Sorts them by the order field.  */
> +
> +static int
> +sort_by_order (const void *x, const void *y)
> +{
> +  store_immediate_info *const *tmp = (store_immediate_info * const *) x;
> +  store_immediate_info *const *tmp2 = (store_immediate_info * const *) y;
> +
> +  if ((*tmp)->order < (*tmp2)->order)
> +    return -1;
> +  else if ((*tmp)->order > (*tmp2)->order)
> +    return 1;
> +
> +  gcc_unreachable ();
> +  return 0;

likewise.

> +}
> +
> +/* Initialize a merged_store_group object from a store_immediate_info
> +   object.  */
> +
> +merged_store_group::merged_store_group (store_immediate_info *info)
> +{
> +  start = info->bitpos;
> +  width = info->bitsize;
> +  /* VAL has memory allocated for it in apply_stores once the group
> +     width has been finalized.  */
> +  val = NULL;
> +  align = get_object_alignment (info->dest);
> +  stores.create (1);
> +  stores.safe_push (info);
> +  last_stmt = info->stmt;
> +  last_order = info->order;
> +  first_stmt = last_stmt;
> +  first_order = last_order;
> +  buf_size = 0;
> +}
> +
> +merged_store_group::~merged_store_group ()
> +{
> +  if (val)
> +    XDELETEVEC (val);
> +}
> +
> +/* Merge a store recorded by INFO into this merged store.
> +   The store is not overlapping with the existing recorded
> +   stores.  */
> +
> +void
> +merged_store_group::merge_into (store_immediate_info *info)
> +{
> +  unsigned HOST_WIDE_INT wid = info->bitsize;
> +  /* Make sure we're inserting in the position we think we're inserting.  */
> +  gcc_assert (info->bitpos == start + width);
> +
> +  width += wid;
> +  gimple *stmt = info->stmt;
> +  stores.safe_push (info);
> +  if (info->order > last_order)
> +    {
> +      last_order = info->order;
> +      last_stmt = stmt;
> +    }
> +  else if (info->order < first_order)
> +    {
> +      first_order = info->order;
> +      first_stmt = stmt;
> +    }
> +}
> +
> +/* Merge a store described by INFO into this merged store.
> +   INFO overlaps in some way with the current store (i.e. it's not contiguous
> +   which is handled by merged_store_group::merge_into).  */
> +
> +void
> +merged_store_group::merge_overlapping (store_immediate_info *info)
> +{
> +  gimple *stmt = info->stmt;
> +  stores.safe_push (info);
> +
> +  /* If the store extends the size of the group, extend the width.  */
> +  if ((info->bitpos + info->bitsize) > (start + width))
> +    width += info->bitpos + info->bitsize - (start + width);
> +
> +  if (info->order > last_order)
> +    {
> +      last_order = info->order;
> +      last_stmt = stmt;
> +    }
> +  else if (info->order < first_order)
> +    {
> +      first_order = info->order;
> +      first_stmt = stmt;
> +    }
> +}
> +
> +/* Go through all the recorded stores in this group in program order and
> +   apply their values to the VAL byte array to create the final merged
> +   value.  */
> +
> +void
> +merged_store_group::apply_stores ()
> +{
> +  stores.qsort (sort_by_order);
> +  struct store_immediate_info *info;
> +  unsigned int i;
> +  /* Create a buffer of a size that is 2 times the number of bytes we're
> +     storing.  That way native_encode_expr can write power-of-2-sized
> +     chunks without overrunning.  */
> +  buf_size
> +    = 2 * (ROUND_UP (width, BITS_PER_UNIT) / BITS_PER_UNIT);
> +  val = XNEWVEC (unsigned char, buf_size);
> +  memset (val, 0, buf_size);

XCNEWVEC

> +
> +  FOR_EACH_VEC_ELT (stores, i, info)
> +    {
> +      unsigned int pos_in_buffer = info->bitpos - start;
> +      encode_tree_to_bitpos (info->val, val, info->bitsize, pos_in_buffer,
> +                          buf_size);
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +     {
> +       fprintf (dump_file, "After writing ");
> +       print_generic_expr (dump_file, info->val, 0);
> +       fprintf (dump_file, " of size " HOST_WIDE_INT_PRINT_DEC
> +               "at position %d the merged region contains:\n",
> +                info->bitsize, pos_in_buffer);
> +       dump_char_array (dump_file, val, buf_size);
> +     }
> +    }
> +}
> +
> +/* Structure describing the store chain.  */
> +
> +struct imm_store_chain_info
> +{
> +  auto_vec<struct store_immediate_info *> m_store_info;
> +  auto_vec<merged_store_group *> m_merged_store_groups;
> +
> +  bool terminate_and_process_chain (tree);
> +  bool coalesce_immediate_stores ();
> +  bool output_merged_store (tree, merged_store_group *);
> +  bool output_merged_stores (tree);
> +};
> +
> +const pass_data pass_data_tree_store_merging = {
> +  GIMPLE_PASS,     /* type */
> +  "store-merging", /* name */
> +  OPTGROUP_NONE,   /* optinfo_flags */
> +  TV_NONE,    /* tv_id */
> +  PROP_ssa,  /* properties_required */
> +  0,            /* properties_provided */
> +  0,            /* properties_destroyed */
> +  0,            /* todo_flags_start */
> +  TODO_update_ssa, /* todo_flags_finish */
> +};
> +
> +class pass_store_merging : public gimple_opt_pass
> +{
> +public:
> +  pass_store_merging (gcc::context *ctxt)
> +    : gimple_opt_pass (pass_data_tree_store_merging, ctxt)
> +  {
> +  }
> +
> +  /* Pass not supported for PDP-endianness.  */
> +  virtual bool
> +  gate (function *)
> +  {
> +    return flag_store_merging && optimize

the && optimize is not necessary.

> +        && (WORDS_BIG_ENDIAN == BYTES_BIG_ENDIAN);
> +  }
> +
> +  virtual unsigned int execute (function *);
> +
> +private:
> +  hash_map<tree_operand_hash, struct imm_store_chain_info *> m_stores;
> +
> +  bool terminate_and_process_all_chains (basic_block);
> +  bool terminate_all_aliasing_chains (tree, tree, gimple *);
> +  bool terminate_and_release_chain (tree);
> +}; // class pass_store_merging
> +
> +/* Terminate and process all recorded chains.  Return true if any changes
> +   were made.  The chains occur in basic block BB, so purge potentially dead
> +   EH edges from there.  */
> +
> +bool
> +pass_store_merging::terminate_and_process_all_chains (basic_block bb)
> +{
> +  hash_map<tree_operand_hash, struct imm_store_chain_info *>::iterator iter
> +    = m_stores.begin ();
> +  bool ret = false;
> +  for (; iter != m_stores.end (); ++iter)
> +    ret |= terminate_and_release_chain ((*iter).first);
> +
> +  if (ret)
> +    gimple_purge_dead_eh_edges (bb);

Why do you need this?  I don't see how merging stores should effect EH
edges at all -- unless you are removing EH side-effects (ISTR you
are not merging cross-BB).

> +  return ret;
> +}
> +
> +/* Terminate all chains that are affected by the assignment to DEST, 
> appearing
> +   in statement STMT and ultimately points to the object BASE.  Return true 
> if
> +   at least one aliasing chain was terminated.  BASE and DEST are allowed to
> +   be NULL_TREE.  In that case the aliasing checks are performed on the whole
> +   statement rather than a particular operand in it.  */
> +
> +bool
> +pass_store_merging::terminate_all_aliasing_chains (tree dest, tree base,
> +                                                gimple *stmt)
> +{
> +  bool ret = false;
> +
> +  /* If the statement doesn't touch memory it can't alias.  */
> +  if (!gimple_vuse (stmt))
> +    return false;
> +
> +  struct imm_store_chain_info **chain_info = NULL;
> +

Please add a comment what case you handle here ...

> +  basic_block bb = gimple_bb (stmt);
> +  if (base)
> +    {
> +      chain_info = m_stores.get (base);
> +      if (chain_info)
> +     {
> +       struct store_immediate_info *info;
> +       unsigned int i;
> +       FOR_EACH_VEC_ELT ((*chain_info)->m_store_info, i, info)
> +         {
> +           if (refs_may_alias_p (info->dest, dest))
> +             {
> +               if (dump_file)
> +                 {
> +                   fprintf (dump_file, "stmt causes chain termination:\n");
> +                   print_gimple_stmt (dump_file, stmt, 0, 0);
> +                 }
> +               terminate_and_release_chain (base);
> +               ret = true;
> +               break;
> +             }
> +         }
> +     }
> +    }

... and which here.  The one below is kind of obvious but I do not
understand the above (from the chain_info check below it looks like
it is not necessary?)

> +  hash_map<tree_operand_hash, struct imm_store_chain_info *>::iterator iter
> +    = m_stores.begin ();
> +
> +  for (; iter != m_stores.end (); ++iter)
> +    {
> +      if (chain_info && (*chain_info) == (*iter).second)
> +     continue;
> +
> +      tree key = (*iter).first;
> +      if (ref_maybe_used_by_stmt_p (stmt, key)
> +       || stmt_may_clobber_ref_p (stmt, key))
> +     {
> +       terminate_and_release_chain (key);
> +       ret = true;
> +     }
> +    }
> +  if (ret)
> +    gimple_purge_dead_eh_edges (bb);

See above.

> +  return ret;
> +}
> +
> +/* Helper function.  Terminate the recorded chain storing to base object
> +   BASE.  Return true if the merging and output was successful.  The m_stores
> +   entry is removed after the processing in any case.  */
> +
> +bool
> +pass_store_merging::terminate_and_release_chain (tree base)
> +{
> +  struct imm_store_chain_info **chain_info = m_stores.get (base);
> +
> +  if (!chain_info)
> +    return false;
> +
> +  gcc_assert (*chain_info);
> +
> +  bool ret = (*chain_info)->terminate_and_process_chain (base);
> +  delete *chain_info;
> +  m_stores.remove (base);
> +
> +  return ret;
> +}
> +
> +/* Go through the candidate stores recorded in m_store_info and merge them
> +   into merged_store_group objects recorded into m_merged_store_groups
> +   representing the widened stores.  Return true if coalescing was successful
> +   and the number of widened stores is fewer than the original number
> +   of stores.  */
> +
> +bool
> +imm_store_chain_info::coalesce_immediate_stores ()
> +{
> +  /* Anything less can't be processed.  */
> +  if (m_store_info.length () < 2)
> +    return false;
> +
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    fprintf (dump_file, "Attempting to coalesce %u stores in chain.\n",
> +          m_store_info.length ());
> +
> +  store_immediate_info *info;
> +  unsigned int i;
> +
> +  /* Order the stores by the bitposition they write to.  */
> +  m_store_info.qsort (sort_by_bitpos);
> +
> +  info = m_store_info[0];
> +  merged_store_group *merged_store = new merged_store_group (info);
> +  m_merged_store_groups.safe_push (merged_store);
> +
> +  FOR_EACH_VEC_ELT (m_store_info, i, info)
> +    {
> +      if (dump_file)
> +     {
> +       fprintf (dump_file, "Store %u:\nbitsize:" HOST_WIDE_INT_PRINT_DEC
> +                           " bitpos:" HOST_WIDE_INT_PRINT_DEC " val:\n",
> +                i, info->bitsize, info->bitpos);
> +       print_generic_expr (dump_file, info->val, 0);
> +       fprintf (dump_file, "\n------------\n");
> +     }
> +
> +      if (i == 0)
> +     continue;
> +
> +      /* |---store 1---|
> +            |---store 2---|
> +       Overlapping stores.  */
> +      unsigned HOST_WIDE_INT start = info->bitpos;
> +      if (IN_RANGE (start, merged_store->start,
> +                 merged_store->start + merged_store->width - 1))
> +     {
> +       merged_store->merge_overlapping (info);
> +       continue;
> +     }
> +
> +      /* |---store 1---| <gap> |---store 2---|.
> +      Gap between stores.  Start a new group.  */
> +      if (start != merged_store->start + merged_store->width)
> +     {
> +       merged_store = new merged_store_group (info);
> +       m_merged_store_groups.safe_push (merged_store);

So this will create merged_store_group even for single-store groups
(all stores have gaps between them)?  Can't we optimize this maybe
common case to not perform the memory allocation?  Like handle
this case first and lazily allocate a new group.  Looks like
apply_stores also doesn't bail out early for single-element groups.

> +       continue;
> +     }
> +
> +      /* |---store 1---||---store 2---|
> +      This store is consecutive to the previous one.
> +      Merge it into the current store group.  */
> +       merged_store->merge_into (info);
> +    }
> +
> +  gcc_assert (m_merged_store_groups.length () <= m_store_info.length ());
> +  bool success = m_merged_store_groups.length () < m_store_info.length ();
> +  if (dump_file)
> +    {
> +      if (success)

merge the if()s

> +     fprintf (dump_file, "Coalescing successful!\n"
> +                         "Merged into %u stores\n",
> +              m_merged_store_groups.length ());
> +    }
> +
> +  if (success)
> +    {

excess {

> +      FOR_EACH_VEC_ELT (m_merged_store_groups, i, merged_store)
> +     {

Likewise.

> +       merged_store->apply_stores ();
> +     }
> +    }
> +  return success;
> +}
> +
> +/* Return the type to use for the merged stores described by GROUP.
> +   This is needed to get the alias sets right.  */
> +
> +static tree
> +get_type_for_merged_store (merged_store_group *group)

The name is somewhat confusing, rename it to 
get_alias_type_for_merged_store.

> +{
> +  struct store_immediate_info *store;
> +  unsigned int i;
> +  tree lhs = gimple_assign_lhs (group->stores[0]->stmt);
> +  tree type = reference_alias_ptr_type (lhs);
> +  FOR_EACH_VEC_ELT (group->stores, i, store)
> +    {
> +      gimple *stmt = store->stmt;
> +      if (i == 0)
> +     continue;
> +
> +      lhs = gimple_assign_lhs (stmt);
> +      tree type1 = reference_alias_ptr_type (lhs);
> +      if (!alias_ptr_types_compatible_p (type, type1))
> +     return ptr_type_node;
> +    }
> +  return type;
> +}
> +
> +/* Return the location_t information we can find among the statements
> +   in GROUP.  */
> +
> +static location_t
> +get_merged_store_location (merged_store_group *group)
> +{
> +  struct store_immediate_info *store;
> +  unsigned int i;
> +
> +  FOR_EACH_VEC_ELT (group->stores, i, store)
> +    {
> +      gimple *stmt = store->stmt;
> +      if (gimple_has_location (stmt))
> +     return gimple_location (stmt);
> +    }
> +  return UNKNOWN_LOCATION;
> +}
> +
> +/* Return true if storing an integer of bitsize SIZE using an unaligned
> +   access is prohibitively slow.  */
> +
> +static bool
> +slow_unaligned_size_access (unsigned int size)
> +{
> +  machine_mode mode = mode_for_size (size, MODE_INT, 0);
> +  gcc_assert (mode != BLKmode);
> +  return SLOW_UNALIGNED_ACCESS (mode, GET_MODE_ALIGNMENT (mode));

I suspect this will never return true given you pass in the modes
alignment itself.

But I think we determined that even for SLOW_UNALIGNED_ACCESS it is
better to merge the store anyway as the expander can do a better
job than the user.  Thus -> remove.

> +}
> +
> +/* Given a merged store group GROUP output the widened version of it.
> +   The store chain is against the base object BASE.
> +   Try store sizes of at most MAX_STORE_BITSIZE bits wide and don't output
> +   unaligned stores for STRICT_ALIGNMENT targets or if it's too expensive.
> +   Make sure that the number of statements output is less than the number of
> +   original statements.  If a better sequence is possible emit it and
> +   return true.  */
> +
> +bool
> +imm_store_chain_info::output_merged_store (tree base, merged_store_group 
> *group)
> +{
> +  unsigned HOST_WIDE_INT pos = group->start;
> +  unsigned HOST_WIDE_INT size = group->width;
> +  unsigned HOST_WIDE_INT bytepos = pos / BITS_PER_UNIT;
> +  unsigned int align = group->align;
> +
> +  unsigned int orig_num_stmts = group->stores.length ();
> +  if (orig_num_stmts < 2)
> +    return false;
> +
> +  /* We don't handle partial bitfields for now.  */
> +  if ((size % BITS_PER_UNIT != 0) || (pos % BITS_PER_UNIT != 0))
> +    return false;

I suppose we could have guarded against this earlier and avoided
quite some work.  Maybe simply never start a chain at
pos % BITS_PER_UNIT and eventually "back-track" from a terminator
that ends up at a non-byte-position end.  This way you can still
cover properly aligned parts of a series of bitfield stores.

That is, both of the above checks belong to the analysis phase.

> +
> +  gimple_stmt_iterator last_gsi = gsi_for_stmt (group->last_stmt);
> +  basic_block bb = gsi_bb (last_gsi);
> +
> +  /* Allow unaligned accesses in cold blocks.  This depends on the predict.c
> +     machinery getting the estimates right.  */
> +  bool allow_unaligned
> +    = !STRICT_ALIGNMENT && PARAM_VALUE (PARAM_STORE_MERGING_ALLOW_UNALIGNED)
> +      && optimize_bb_for_size_p (bb);

See above for the unaligned case.  Btw, if you use
get_object_alignment_1 you also get misalignemnt info, thus
maybe a group starts at 16 byte aligned + 8 and is of size 24
then you can use an 8 byte and a 16 byte store, both aligned.

> +  unsigned int try_size = MAX_STORE_BITSIZE;
> +  while (try_size > size
> +      || ((!allow_unaligned || slow_unaligned_size_access (try_size))
> +          && try_size > align))
> +    {
> +      try_size /= 2;
> +      if (try_size < BITS_PER_UNIT)
> +     {
> +       if (dump_file && (dump_flags & TDF_DETAILS))
> +         {
> +           fprintf (dump_file, "Failed to output merged store.\n"
> +                               "Failed to find starting size meeting"
> +                               " alignment requirements.\n");
> +         }
> +       return false;
> +     }
> +    }
> +
> +  unsigned HOST_WIDE_INT try_pos = bytepos;
> +  int wi_offset = 0;
> +  if (dump_file)
> +    {
> +      fprintf (dump_file,
> +            "Trying to output merged store at pos " HOST_WIDE_INT_PRINT_DEC
> +            ", of size " HOST_WIDE_INT_PRINT_DEC ", "
> +            "starting size %u and alignment %u\n",
> +            pos, size, try_size, align);
> +    }
> +
> +  gimple_seq seq = NULL;
> +  unsigned int num_stmts = 0;
> +  tree offset_type = get_type_for_merged_store (group);
> +  tree last_vdef, new_vuse;
> +  last_vdef = gimple_vdef (group->last_stmt);
> +  new_vuse = gimple_vuse (group->last_stmt);
> +  location_t loc = get_merged_store_location (group);

If you end up splitting the store then please use a location appropriate
for the split part.  Likewise for the alias type.

> +  gimple *stmt = NULL;
> +
> +  while (size > 0)
> +    {
> +      tree int_type = build_nonstandard_integer_type (try_size, UNSIGNED);
> +      SET_TYPE_ALIGN (int_type, align);
> +      tree addr = build_fold_addr_expr (base);
> +      tree dest = fold_build2 (MEM_REF, int_type, addr,
> +                            build_int_cst (offset_type, try_pos));
> +
> +      tree src = native_interpret_expr (int_type,
> +                                     group->val + wi_offset / BITS_PER_UNIT,
> +                                     group->buf_size);
> +
> +      if (dump_file)
> +     {
> +       fprintf (dump_file, "Output src is:\n");
> +       print_generic_expr (dump_file, src, 0);
> +       fprintf (dump_file, "\n");
> +     }
> +      stmt = gimple_build_assign (dest, src);
> +      gimple_set_vuse (stmt, new_vuse);
> +      gimple_seq_add_stmt_without_update (&seq, stmt);
> +      gimple_set_bb (stmt, gsi_bb (last_gsi));

setting BB is not necessary -- but I also think that when you end up
splitting the group you should insert the new stores close to the
original position of the stores.

> +
> +      num_stmts++;
> +
> +      try_pos += try_size / BITS_PER_UNIT;
> +
> +      wi_offset += try_size;
> +      size -= try_size;
> +      align = try_size;
> +      while (size < try_size)
> +     try_size /= 2;
> +
> +      if (num_stmts >= orig_num_stmts)
> +     {

Hmm.  I think that if you'd end up with an unchanged stmt simply leave it
there.  I don't see how you'd ever end up with more stmts than in the
original source?

> +       if (dump_file)
> +         {
> +           fprintf (dump_file, "Exceeded original number of stmts (%u)."
> +                               "  Not profitable to emit new sequence.\n",
> +                    orig_num_stmts);
> +         }

You should at least release SSA names created sofar.

> +       return false;
> +     }
> +
> +      tree new_vdef
> +     = size ? make_ssa_name (gimple_vop (cfun), stmt) : last_vdef;
> +      gimple_set_vdef (stmt, new_vdef);
> +      SSA_NAME_DEF_STMT (new_vdef) = stmt;
> +      new_vuse = new_vdef;
> +    }
> +  gcc_assert (size == 0);
> +  gcc_assert (seq);
> +  annotate_all_with_location (seq, loc);
> +  if (dump_file)
> +    {
> +      fprintf (dump_file,
> +            "New sequence of %u stmts to replace old one of %u stmts\n",
> +            num_stmts, orig_num_stmts);
> +      if (dump_flags & TDF_DETAILS)
> +     print_gimple_seq (dump_file, seq, 0, TDF_VOPS | TDF_MEMSYMS);
> +    }
> +  gsi_insert_seq_after (&last_gsi, seq, GSI_SAME_STMT);

That said, maybe it's easiest to implement a "split groups" according
to the cost stuff above and have actual stmt creation / emission
work on groups suitable for emission.

> +  return true;
> +}
> +
> +/* Process the merged_store_group objects created in the coalescing phase.
> +   The stores are all against the base object BASE.
> +   Try to output the widened stores and delete the original statements if
> +   successful.  Return true iff any changes were made.  */
> +
> +bool
> +imm_store_chain_info::output_merged_stores (tree base)
> +{
> +  unsigned int i;
> +  merged_store_group *merged_store;
> +  bool ret = false;
> +  FOR_EACH_VEC_ELT (m_merged_store_groups, i, merged_store)
> +    {
> +      if (output_merged_store (base, merged_store))
> +     {
> +       unsigned int j;
> +       store_immediate_info *store;
> +       FOR_EACH_VEC_ELT (merged_store->stores, j, store)
> +         {
> +           gimple *stmt = store->stmt;
> +           gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
> +           gsi_remove (&gsi, true);
> +           if (stmt != merged_store->last_stmt)
> +             {
> +               unlink_stmt_vdef (stmt);
> +               release_defs (stmt);
> +             }
> +         }
> +       ret = true;
> +     }
> +    }
> +  if (ret && dump_file)
> +    fprintf (dump_file, "Merging successful!\n");
> +
> +  return ret;
> +}
> +
> +/* Coalesce the store_immediate_info objects recorded against the base object
> +   BASE in the first phase and output them.
> +   Delete the allocated structures.
> +   Return true if any changes were made.  */
> +
> +bool
> +imm_store_chain_info::terminate_and_process_chain (tree base)
> +{
> +  /* Process store chain.  */
> +  bool ret = false;
> +  if (m_store_info.length () > 1)
> +    {
> +      ret = coalesce_immediate_stores ();
> +      if (ret)
> +     ret = output_merged_stores (base);
> +    }
> +
> +  /* Delete all the entries we allocated ourselves.  */
> +  store_immediate_info *info;
> +  unsigned int i;
> +  FOR_EACH_VEC_ELT (m_store_info, i, info)
> +    delete info;
> +
> +  merged_store_group *merged_info;
> +  FOR_EACH_VEC_ELT (m_merged_store_groups, i, merged_info)
> +    delete merged_info;
> +
> +  return ret;
> +}
> +
> +/* Return true iff LHS is a destination potentially interesting for
> +   store merging.  */
> +
> +static bool
> +lhs_code_valid_for_store_merging_p (tree lhs)
> +{
> +  tree_code code = TREE_CODE (lhs);
> +  tree type = TREE_TYPE (lhs);
> +
> +  /* If the target requests a promotion to a wider type get_inner_reference 
> may
> +     give bad results for the store destination (for example __fp16 to float
> +     promotion in aarch64).  Be conservative here.  */

?

> +  tree prom_type = targetm.promoted_type (type);
> +  if (prom_type && TYPE_PRECISION (type) != TYPE_PRECISION (prom_type))
> +    return false;
> +
> +  if (code == ARRAY_REF || code == ARRAY_RANGE_REF || code == MEM_REF
> +      || code == COMPONENT_REF || code == BIT_FIELD_REF)

Cheaper check, do it first.  I think you want simply
REFERENCE_CLASS_P (lhs)

> +    return true;
> +
> +  return false;
> +}
> +
> +/* Return true if the tree RHS is a constant we want to consider
> +   during store merging.  In practice accept all codes that
> +   native_encode_expr accepts.  */

I've seen you don't do any error checks on native_encode_expr (it
returns 0 on failure).

> +static bool
> +rhs_valid_for_store_merging_p (tree rhs)
> +{
> +  tree_code code = TREE_CODE (rhs);
> +  if (TREE_CODE_CLASS (code) != tcc_constant
> +      || !tree_fits_shwi_p (TYPE_SIZE (TREE_TYPE (rhs))))

The latter check should be always true.  Can you please add a
can_native_encode_type_p alongside can_native_interpret_type_p and
use that instead?

Another interesting case to handle would be a store from an
empty CONSTRUCTOR (that's zeroing an aggregate).

> +    return false;
> +
> +  return code != VOID_CST;

Huh, do you really hit this?

> +}
> +
> +/* Entry point for the pass.  Go over each basic block recording chains of
> +  immediate stores.  Upon encountering a terminating statement (as defined
> +  by stmt_terminates_chain_p) process the recorded stores and emit the 
> widened
> +  variants.  */
> +
> +unsigned int
> +pass_store_merging::execute (function *fun)
> +{
> +  basic_block bb;
> +  hash_set<gimple *> orig_stmts;
> +
> +  FOR_EACH_BB_FN (bb, fun)
> +    {
> +      gimple_stmt_iterator gsi;
> +      unsigned HOST_WIDE_INT num_statements = 0;
> +      /* Record the original statements so that we can keep track of
> +      statements emitted in this pass and not re-process new
> +      statements.  */
> +      for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> +     {
> +       gimple_set_visited (gsi_stmt (gsi), false);

You don't seem to use the visited flag any more.

> +       if (is_gimple_debug (gsi_stmt (gsi)))
> +         continue;
> +
> +       num_statements++;

So this can be short-cutted at == 2.

> +     }
> +
> +      if (num_statements < 2)
> +     continue;
> +
> +      if (dump_file)
> +     fprintf (dump_file, "Processing basic block <%d>:\n", bb->index);

just noticed your dumps are quite verbose -- please dump all but
actual transforms only with (dump_flags & TDF_DETAILS)

> +      for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> +     {
> +       gimple *stmt = gsi_stmt (gsi);
> +
> +       if (gimple_has_volatile_ops (stmt))
> +         {
> +           /* Terminate all chains.  */
> +           if (dump_file)
> +             fprintf (dump_file, "Volatile access terminates "
> +                                 "all chains\n");
> +           terminate_and_process_all_chains (bb);
> +           continue;
> +         }
> +
> +       if (is_gimple_debug (stmt))
> +         continue;
> +
> +       if (gimple_assign_single_p (stmt) && gimple_vdef (stmt)
> +           && lhs_code_valid_for_store_merging_p (gimple_assign_lhs (stmt)))

Can't you check rhs_valid_for_store_merging_p here and avoid the
costly get_inner_reference?

> +         {
> +           tree lhs = gimple_assign_lhs (stmt);
> +           tree rhs = gimple_assign_rhs1 (stmt);
> +
> +           HOST_WIDE_INT bitsize, bitpos;
> +           machine_mode mode;
> +           int unsignedp = 0, reversep = 0, volatilep = 0;
> +           tree offset, base_addr;
> +           base_addr
> +             = get_inner_reference (lhs, &bitsize, &bitpos, &offset, &mode,
> +                                    &unsignedp, &reversep, &volatilep);
> +           bool invalid = offset || reversep

so same offset should be ok.  As an enhancement you might consider
a get_inner_reference variant only working up to the first handled 
component that has a variable offset and use the remaining ref as
"base".  As enhacement for later -- adding a comment for this would
be nice.

> +                          || ((bitsize > MAX_BITSIZE_MODE_ANY_INT)
> +                              && (bitpos % BITS_PER_UNIT)
> +                              && (TREE_CODE (rhs) != INTEGER_CST))

I guess this can go if you can simplify the bit merging.

> +                          || !rhs_valid_for_store_merging_p (rhs)
> +                          || TREE_THIS_VOLATILE (base_addr);

TREE_THIS_VOLATILE is handled above already.

> +
> +           /* In some cases get_inner_reference may return a
> +              MEM_REF [ptr + byteoffset].  For the purposes of this pass
> +              canonicalize the base_addr to MEM_REF [ptr] and take
> +              byteoffset into account in the bitpos.  This occurs in
> +              PR 23684 and this way we can catch more chains.  */
> +           if (TREE_CODE (base_addr) == MEM_REF
> +               && POINTER_TYPE_P (TREE_TYPE (TREE_OPERAND (base_addr, 0)))
> +               && TREE_CODE (TREE_OPERAND (base_addr, 1)) == INTEGER_CST

This is always an INTEGER_CST.

> +               && tree_fits_shwi_p (TREE_OPERAND (base_addr, 1))

This will never allow negative offsets (but maybe this is a good thing?)

)
> +             {
> +               bitpos += tree_to_shwi (TREE_OPERAND (base_addr, 1))
> +                         * BITS_PER_UNIT;

this multiplication may overflow.  There is mem_ref_offset () which
you should really use here, see get_inner_reference itself (and
how to translate back from offset_int to HOST_WIDE_INT if it fits).

> +
> +               base_addr = fold_build2 (MEM_REF, TREE_TYPE (base_addr),
> +                                        TREE_OPERAND (base_addr, 0),
> +                                        build_zero_cst (TREE_TYPE (
> +                                          TREE_OPERAND (base_addr, 1))));

Ugh, building a tree node ... you could use TREE_OPERAND (base_addr, 0)
as base_addr instead?


> +             }
> +
> +           struct imm_store_chain_info **chain_info
> +             = m_stores.get (base_addr);
> +
> +           if (!invalid)
> +             {
> +               store_immediate_info *info;
> +               if (chain_info)
> +                 {
> +                   info = new store_immediate_info (
> +                     bitsize, bitpos, rhs, lhs, stmt,
> +                     (*chain_info)->m_store_info.length ());
> +                   if (dump_file)
> +                     {
> +                       fprintf (dump_file,
> +                                "Recording immediate store from stmt:\n");
> +                       print_gimple_stmt (dump_file, stmt, 0, 0);
> +                     }
> +                   (*chain_info)->m_store_info.safe_push (info);
> +                   continue;
> +                 }
> +
> +               /* Store aliases any existing chain?  */
> +               terminate_all_aliasing_chains (lhs, base_addr, stmt);
> +
> +               /* Start a new chain.  */
> +               struct imm_store_chain_info *new_chain
> +                 = new imm_store_chain_info;
> +               info = new store_immediate_info (bitsize, bitpos, rhs, lhs,
> +                                                stmt, 0);
> +               new_chain->m_store_info.safe_push (info);
> +               m_stores.put (base_addr, new_chain);
> +               if (dump_file)
> +                 {
> +                   fprintf (dump_file,
> +                            "Starting new chain with statement:\n");
> +                   print_gimple_stmt (dump_file, stmt, 0, 0);
> +                   fprintf (dump_file, "The base object is:\n");
> +                   print_generic_expr (dump_file, base_addr, 0);
> +                   fprintf (dump_file, "\n");
> +                 }
> +             }
> +           else
> +             terminate_all_aliasing_chains (lhs, base_addr, stmt);
> +
> +           continue;
> +         }
> +
> +       terminate_all_aliasing_chains (NULL_TREE, NULL_TREE, stmt);
> +     }
> +      terminate_and_process_all_chains (bb);
> +    }
> +  return 0;
> +}
> +
> +} // anon namespace
> +
> +/* Construct and return a store merging pass object.  */
> +
> +gimple_opt_pass *
> +make_pass_store_merging (gcc::context *ctxt)
> +{
> +  return new pass_store_merging (ctxt);
> +}
> diff --git a/gcc/opts.c b/gcc/opts.c
> index 45f1f89c..e63d7e4 100644
> --- a/gcc/opts.c
> +++ b/gcc/opts.c
> @@ -463,6 +463,7 @@ static const struct default_options 
> default_options_table[] =
>      { OPT_LEVELS_1_PLUS, OPT_ftree_dse, NULL, 1 },
>      { OPT_LEVELS_1_PLUS, OPT_ftree_ter, NULL, 1 },
>      { OPT_LEVELS_1_PLUS_NOT_DEBUG, OPT_ftree_sra, NULL, 1 },
> +    { OPT_LEVELS_1_PLUS_NOT_DEBUG, OPT_fstore_merging, NULL, 1 },

Please leave it to -O[2s]+ -- the chain invalidation is quadratic and
-O1 should work well even for gigantic basic blocks.

Overall the pass looks quite well with the comments addressed.

Thanks,
Richard.

>      { OPT_LEVELS_1_PLUS, OPT_ftree_fre, NULL, 1 },
>      { OPT_LEVELS_1_PLUS, OPT_ftree_copy_prop, NULL, 1 },
>      { OPT_LEVELS_1_PLUS, OPT_ftree_sink, NULL, 1 },
> diff --git a/gcc/params.def b/gcc/params.def
> index 8907aa4..e63e594 100644
> --- a/gcc/params.def
> +++ b/gcc/params.def
> @@ -1100,6 +1100,12 @@ DEFPARAM (PARAM_MAX_TAIL_MERGE_COMPARISONS,
>            "Maximum amount of similar bbs to compare a bb with.",
>            10, 0, 0)
>  
> +DEFPARAM (PARAM_STORE_MERGING_ALLOW_UNALIGNED,
> +       "store-merging-allow-unaligned",
> +       "Allow the store merging pass to introduce unaligned stores "
> +       "if it is legal to do so",
> +       1, 0, 1)
> +
>  DEFPARAM (PARAM_MAX_TAIL_MERGE_ITERATIONS,
>            "max-tail-merge-iterations",
>            "Maximum amount of iterations of the pass over a function.",
> diff --git a/gcc/passes.def b/gcc/passes.def
> index 2830421..ee7dd50 100644
> --- a/gcc/passes.def
> +++ b/gcc/passes.def
> @@ -329,6 +329,7 @@ along with GCC; see the file COPYING3.  If not see
>        NEXT_PASS (pass_phiopt);
>        NEXT_PASS (pass_fold_builtins);
>        NEXT_PASS (pass_optimize_widening_mul);
> +      NEXT_PASS (pass_store_merging);
>        NEXT_PASS (pass_tail_calls);
>        /* If DCE is not run before checking for uninitialized uses,
>        we may get false warnings (e.g., testsuite/gcc.dg/uninit-5.c).
> diff --git a/gcc/testsuite/gcc.c-torture/execute/pr22141-1.c 
> b/gcc/testsuite/gcc.c-torture/execute/pr22141-1.c
> new file mode 100644
> index 0000000..7c888b4
> --- /dev/null
> +++ b/gcc/testsuite/gcc.c-torture/execute/pr22141-1.c
> @@ -0,0 +1,122 @@
> +/* PR middle-end/22141 */
> +
> +extern void abort (void);
> +
> +struct S
> +{
> +  struct T
> +    {
> +      char a;
> +      char b;
> +      char c;
> +      char d;
> +    } t;
> +} u;
> +
> +struct U
> +{
> +  struct S s[4];
> +};
> +
> +void __attribute__((noinline))
> +c1 (struct T *p)
> +{
> +  if (p->a != 1 || p->b != 2 || p->c != 3 || p->d != 4)
> +    abort ();
> +  __builtin_memset (p, 0xaa, sizeof (*p));
> +}
> +
> +void __attribute__((noinline))
> +c2 (struct S *p)
> +{
> +  c1 (&p->t);
> +}
> +
> +void __attribute__((noinline))
> +c3 (struct U *p)
> +{
> +  c2 (&p->s[2]);
> +}
> +
> +void __attribute__((noinline))
> +f1 (void)
> +{
> +  u = (struct S) { { 1, 2, 3, 4 } };
> +}
> +
> +void __attribute__((noinline))
> +f2 (void)
> +{
> +  u.t.a = 1;
> +  u.t.b = 2;
> +  u.t.c = 3;
> +  u.t.d = 4;
> +}
> +
> +void __attribute__((noinline))
> +f3 (void)
> +{
> +  u.t.d = 4;
> +  u.t.b = 2;
> +  u.t.a = 1;
> +  u.t.c = 3;
> +}
> +
> +void __attribute__((noinline))
> +f4 (void)
> +{
> +  struct S v;
> +  v.t.a = 1;
> +  v.t.b = 2;
> +  v.t.c = 3;
> +  v.t.d = 4;
> +  c2 (&v);
> +}
> +
> +void __attribute__((noinline))
> +f5 (struct S *p)
> +{
> +  p->t.a = 1;
> +  p->t.c = 3;
> +  p->t.d = 4;
> +  p->t.b = 2;
> +}
> +
> +void __attribute__((noinline))
> +f6 (void)
> +{
> +  struct U v;
> +  v.s[2].t.a = 1;
> +  v.s[2].t.b = 2;
> +  v.s[2].t.c = 3;
> +  v.s[2].t.d = 4;
> +  c3 (&v);
> +}
> +
> +void __attribute__((noinline))
> +f7 (struct U *p)
> +{
> +  p->s[2].t.a = 1;
> +  p->s[2].t.c = 3;
> +  p->s[2].t.d = 4;
> +  p->s[2].t.b = 2;
> +}
> +
> +int
> +main (void)
> +{
> +  struct U w;
> +  f1 ();
> +  c2 (&u);
> +  f2 ();
> +  c1 (&u.t);
> +  f3 ();
> +  c2 (&u);
> +  f4 ();
> +  f5 (&u);
> +  c2 (&u);
> +  f6 ();
> +  f7 (&w);
> +  c3 (&w);
> +  return 0;
> +}
> diff --git a/gcc/testsuite/gcc.c-torture/execute/pr22141-2.c 
> b/gcc/testsuite/gcc.c-torture/execute/pr22141-2.c
> new file mode 100644
> index 0000000..cb9cc79
> --- /dev/null
> +++ b/gcc/testsuite/gcc.c-torture/execute/pr22141-2.c
> @@ -0,0 +1,122 @@
> +/* PR middle-end/22141 */
> +
> +extern void abort (void);
> +
> +struct S
> +{
> +  struct T
> +    {
> +      char a;
> +      char b;
> +      char c;
> +      char d;
> +    } t;
> +} u __attribute__((aligned));
> +
> +struct U
> +{
> +  struct S s[4];
> +};
> +
> +void __attribute__((noinline))
> +c1 (struct T *p)
> +{
> +  if (p->a != 1 || p->b != 2 || p->c != 3 || p->d != 4)
> +    abort ();
> +  __builtin_memset (p, 0xaa, sizeof (*p));
> +}
> +
> +void __attribute__((noinline))
> +c2 (struct S *p)
> +{
> +  c1 (&p->t);
> +}
> +
> +void __attribute__((noinline))
> +c3 (struct U *p)
> +{
> +  c2 (&p->s[2]);
> +}
> +
> +void __attribute__((noinline))
> +f1 (void)
> +{
> +  u = (struct S) { { 1, 2, 3, 4 } };
> +}
> +
> +void __attribute__((noinline))
> +f2 (void)
> +{
> +  u.t.a = 1;
> +  u.t.b = 2;
> +  u.t.c = 3;
> +  u.t.d = 4;
> +}
> +
> +void __attribute__((noinline))
> +f3 (void)
> +{
> +  u.t.d = 4;
> +  u.t.b = 2;
> +  u.t.a = 1;
> +  u.t.c = 3;
> +}
> +
> +void __attribute__((noinline))
> +f4 (void)
> +{
> +  struct S v __attribute__((aligned));
> +  v.t.a = 1;
> +  v.t.b = 2;
> +  v.t.c = 3;
> +  v.t.d = 4;
> +  c2 (&v);
> +}
> +
> +void __attribute__((noinline))
> +f5 (struct S *p)
> +{
> +  p->t.a = 1;
> +  p->t.c = 3;
> +  p->t.d = 4;
> +  p->t.b = 2;
> +}
> +
> +void __attribute__((noinline))
> +f6 (void)
> +{
> +  struct U v __attribute__((aligned));
> +  v.s[2].t.a = 1;
> +  v.s[2].t.b = 2;
> +  v.s[2].t.c = 3;
> +  v.s[2].t.d = 4;
> +  c3 (&v);
> +}
> +
> +void __attribute__((noinline))
> +f7 (struct U *p)
> +{
> +  p->s[2].t.a = 1;
> +  p->s[2].t.c = 3;
> +  p->s[2].t.d = 4;
> +  p->s[2].t.b = 2;
> +}
> +
> +int
> +main (void)
> +{
> +  struct U w __attribute__((aligned));
> +  f1 ();
> +  c2 (&u);
> +  f2 ();
> +  c1 (&u.t);
> +  f3 ();
> +  c2 (&u);
> +  f4 ();
> +  f5 (&u);
> +  c2 (&u);
> +  f6 ();
> +  f7 (&w);
> +  c3 (&w);
> +  return 0;
> +}
> diff --git a/gcc/testsuite/gcc.dg/store_merging_1.c 
> b/gcc/testsuite/gcc.dg/store_merging_1.c
> new file mode 100644
> index 0000000..09a4d14
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/store_merging_1.c
> @@ -0,0 +1,35 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target non_strict_align } */
> +/* { dg-options "-O -fdump-tree-store-merging" } */
> +
> +struct bar {
> +  int a;
> +  char b;
> +  char c;
> +  char d;
> +  char e;
> +  char f;
> +  char g;
> +};
> +
> +void
> +foo1 (struct bar *p)
> +{
> +  p->b = 0;
> +  p->a = 0;
> +  p->c = 0;
> +  p->d = 0;
> +  p->e = 0;
> +}
> +
> +void
> +foo2 (struct bar *p)
> +{
> +  p->b = 0;
> +  p->a = 0;
> +  p->c = 1;
> +  p->d = 0;
> +  p->e = 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Merging successful" 2 "store-merging" 
> } } */
> diff --git a/gcc/testsuite/gcc.dg/store_merging_2.c 
> b/gcc/testsuite/gcc.dg/store_merging_2.c
> new file mode 100644
> index 0000000..d3acc2d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/store_merging_2.c
> @@ -0,0 +1,80 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target non_strict_align } */
> +/* { dg-options "-O -fdump-tree-store-merging" } */
> +
> +struct bar
> +{
> +  int a;
> +  unsigned char b;
> +  unsigned char c;
> +  short d;
> +  unsigned char e;
> +  unsigned char f;
> +  unsigned char g;
> +};
> +
> +__attribute__ ((noinline)) void
> +foozero (struct bar *p)
> +{
> +  p->b = 0;
> +  p->a = 0;
> +  p->c = 0;
> +  p->d = 0;
> +  p->e = 0;
> +  p->f = 0;
> +  p->g = 0;
> +}
> +
> +__attribute__ ((noinline)) void
> +foo1 (struct bar *p)
> +{
> +  p->b = 1;
> +  p->a = 2;
> +  p->c = 3;
> +  p->d = 4;
> +  p->e = 5;
> +  p->f = 0;
> +  p->g = 0xff;
> +}
> +
> +__attribute__ ((noinline)) void
> +foo2 (struct bar *p, struct bar *p2)
> +{
> +  p->b = 0xff;
> +  p2->b = 0xa;
> +  p->a = 0xfffff;
> +  p2->c = 0xc;
> +  p->c = 0xff;
> +  p2->d = 0xbf;
> +  p->d = 0xfff;
> +}
> +
> +int
> +main (void)
> +{
> +  struct bar b1, b2;
> +  foozero (&b1);
> +  foozero (&b2);
> +
> +  foo1 (&b1);
> +  if (b1.b != 1 || b1.a != 2 || b1.c != 3 || b1.d != 4 || b1.e != 5
> +      || b1.f != 0 || b1.g != 0xff)
> +    __builtin_abort ();
> +
> +  foozero (&b1);
> +  /* Make sure writes to aliasing struct pointers preserve the
> +     correct order.  */
> +  foo2 (&b1, &b1);
> +  if (b1.b != 0xa || b1.a != 0xfffff || b1.c != 0xff || b1.d != 0xfff)
> +    __builtin_abort ();
> +
> +  foozero (&b1);
> +  foo2 (&b1, &b2);
> +  if (b1.a != 0xfffff || b1.b != 0xff || b1.c != 0xff || b1.d != 0xfff
> +      || b2.b != 0xa || b2.c != 0xc || b2.d != 0xbf)
> +    __builtin_abort ();
> +
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Merging successful" 2 "store-merging" 
> } } */
> diff --git a/gcc/testsuite/gcc.dg/store_merging_3.c 
> b/gcc/testsuite/gcc.dg/store_merging_3.c
> new file mode 100644
> index 0000000..cd756c1
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/store_merging_3.c
> @@ -0,0 +1,32 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target non_strict_align } */
> +/* { dg-options "-O -fdump-tree-store-merging" } */
> +
> +/* Make sure stores to volatile addresses don't get combined with
> +   other accesses.  */
> +
> +struct bar
> +{
> +  int a;
> +  char b;
> +  char c;
> +  volatile short d;
> +  char e;
> +  char f;
> +  char g;
> +};
> +
> +void
> +foozero (struct bar *p)
> +{
> +  p->b = 0xa;
> +  p->a = 0xb;
> +  p->c = 0xc;
> +  p->d = 0;
> +  p->e = 0xd;
> +  p->f = 0xe;
> +  p->g = 0xf;
> +}
> +
> +/* { dg-final { scan-tree-dump "Volatile access terminates all chains" 
> "store-merging" } } */
> +/* { dg-final { scan-tree-dump-times "=\{v\} 0;" 1 "store-merging" } } */
> diff --git a/gcc/testsuite/gcc.dg/store_merging_4.c 
> b/gcc/testsuite/gcc.dg/store_merging_4.c
> new file mode 100644
> index 0000000..4bf9025
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/store_merging_4.c
> @@ -0,0 +1,32 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target non_strict_align } */
> +/* { dg-options "-O -fdump-tree-store-merging" } */
> +
> +/* Check that we can merge interleaving stores that are guaranteed
> +   to be non-aliasing.  */
> +
> +struct bar
> +{
> +  int a;
> +  char b;
> +  char c;
> +  short d;
> +  char e;
> +  char f;
> +  char g;
> +};
> +
> +void
> +foozero (struct bar *restrict p, struct bar *restrict p2)
> +{
> +  p->b = 0xff;
> +  p2->b = 0xa;
> +  p->a = 0xfffff;
> +  p2->a = 0xab;
> +  p2->c = 0xc;
> +  p->c = 0xff;
> +  p2->d = 0xbf;
> +  p->d = 0xfff;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Merging successful" 2 "store-merging" 
> } } */
> diff --git a/gcc/testsuite/gcc.dg/store_merging_5.c 
> b/gcc/testsuite/gcc.dg/store_merging_5.c
> new file mode 100644
> index 0000000..3b82420
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/store_merging_5.c
> @@ -0,0 +1,30 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target non_strict_align } */
> +/* { dg-options "-O -fdump-tree-store-merging" } */
> +
> +/* Make sure that non-aliasing non-constant interspersed stores do not
> +   stop chains.  */
> +
> +struct bar {
> +  int a;
> +  char b;
> +  char c;
> +  char d;
> +  char e;
> +  char g;
> +};
> +
> +void
> +foo1 (struct bar *p, char tmp)
> +{
> +  p->a = 0;
> +  p->b = 0;
> +  p->g = tmp;
> +  p->c = 0;
> +  p->d = 0;
> +  p->e = 0;
> +}
> +
> +
> +/* { dg-final { scan-tree-dump-times "Merging successful" 1 "store-merging" 
> } } */
> +/* { dg-final { scan-tree-dump-times "MEM\\\[.*\\\]" 1 "store-merging" } } */
> diff --git a/gcc/testsuite/gcc.dg/store_merging_6.c 
> b/gcc/testsuite/gcc.dg/store_merging_6.c
> new file mode 100644
> index 0000000..7d89baf
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/store_merging_6.c
> @@ -0,0 +1,53 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target non_strict_align } */
> +/* { dg-options "-O -fdump-tree-store-merging" } */
> +
> +/* Check that we can widen accesses to bitfields.  */
> +
> +struct bar {
> +  int a : 3;
> +  unsigned char b : 4;
> +  unsigned char c : 1;
> +  char d;
> +  char e;
> +  char f;
> +  char g;
> +};
> +
> +__attribute__ ((noinline)) void
> +foozero (struct bar *p)
> +{
> +  p->b = 0;
> +  p->a = 0;
> +  p->c = 0;
> +  p->d = 0;
> +  p->e = 0;
> +  p->f = 0;
> +  p->g = 0;
> +}
> +
> +__attribute__ ((noinline)) void
> +foo1 (struct bar *p)
> +{
> +  p->b = 3;
> +  p->a = 2;
> +  p->c = 1;
> +  p->d = 4;
> +  p->e = 5;
> +}
> +
> +int
> +main (void)
> +{
> +  struct bar p;
> +  foozero (&p);
> +  foo1 (&p);
> +  if (p.a != 2 || p.b != 3 || p.c != 1 || p.d != 4 || p.e != 5
> +      || p.f != 0 || p.g != 0)
> +    __builtin_abort ();
> +
> +  return 0;
> +}
> +
> +
> +/* { dg-final { scan-tree-dump-times "Merging successful" 2 "store-merging" 
> } } */
> diff --git a/gcc/testsuite/gcc.dg/store_merging_7.c 
> b/gcc/testsuite/gcc.dg/store_merging_7.c
> new file mode 100644
> index 0000000..02008f7
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/store_merging_7.c
> @@ -0,0 +1,26 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target non_strict_align } */
> +/* { dg-options "-O -fdump-tree-store-merging" } */
> +
> +/* Check that we can merge consecutive array members through the pointer.
> +   PR rtl-optimization/23684.  */
> +
> +void
> +foo (char *input)
> +{
> +  input = __builtin_assume_aligned (input, 8);
> +  input[0] = 'H';
> +  input[1] = 'e';
> +  input[2] = 'l';
> +  input[3] = 'l';
> +  input[4] = 'o';
> +  input[5] = ' ';
> +  input[6] = 'w';
> +  input[7] = 'o';
> +  input[8] = 'r';
> +  input[9] = 'l';
> +  input[10] = 'd';
> +  input[11] = '\0';
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Merging successful" 1 "store-merging" 
> } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_1.c 
> b/gcc/testsuite/gcc.target/aarch64/ldp_stp_1.c
> index f02e55f..9de4e77 100644
> --- a/gcc/testsuite/gcc.target/aarch64/ldp_stp_1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_1.c
> @@ -3,22 +3,22 @@
>  int arr[4][4];
>  
>  void
> -foo ()
> +foo (int x, int y)
>  {
> -  arr[0][1] = 1;
> -  arr[1][0] = -1;
> -  arr[2][0] = 1;
> -  arr[1][1] = -1;
> -  arr[0][2] = 1;
> -  arr[0][3] = -1;
> -  arr[1][2] = 1;
> -  arr[2][1] = -1;
> -  arr[3][0] = 1;
> -  arr[3][1] = -1;
> -  arr[2][2] = 1;
> -  arr[1][3] = -1;
> -  arr[2][3] = 1;
> -  arr[3][2] = -1;
> +  arr[0][1] = x;
> +  arr[1][0] = y;
> +  arr[2][0] = x;
> +  arr[1][1] = y;
> +  arr[0][2] = x;
> +  arr[0][3] = y;
> +  arr[1][2] = x;
> +  arr[2][1] = y;
> +  arr[3][0] = x;
> +  arr[3][1] = y;
> +  arr[2][2] = x;
> +  arr[1][3] = y;
> +  arr[2][3] = x;
> +  arr[3][2] = y;
>  }
>  
>  /* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 7 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_4.c 
> b/gcc/testsuite/gcc.target/aarch64/ldp_stp_4.c
> index 40056b1..824f0d2 100644
> --- a/gcc/testsuite/gcc.target/aarch64/ldp_stp_4.c
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_4.c
> @@ -3,22 +3,22 @@
>  float arr[4][4];
>  
>  void
> -foo ()
> +foo (float x, float y)
>  {
> -  arr[0][1] = 1;
> -  arr[1][0] = -1;
> -  arr[2][0] = 1;
> -  arr[1][1] = -1;
> -  arr[0][2] = 1;
> -  arr[0][3] = -1;
> -  arr[1][2] = 1;
> -  arr[2][1] = -1;
> -  arr[3][0] = 1;
> -  arr[3][1] = -1;
> -  arr[2][2] = 1;
> -  arr[1][3] = -1;
> -  arr[2][3] = 1;
> -  arr[3][2] = -1;
> +  arr[0][1] = x;
> +  arr[1][0] = y;
> +  arr[2][0] = x;
> +  arr[1][1] = y;
> +  arr[0][2] = x;
> +  arr[0][3] = y;
> +  arr[1][2] = x;
> +  arr[2][1] = y;
> +  arr[3][0] = x;
> +  arr[3][1] = y;
> +  arr[2][2] = x;
> +  arr[1][3] = y;
> +  arr[2][3] = x;
> +  arr[3][2] = y;
>  }
>  
>  /* { dg-final { scan-assembler-times "stp\ts\[0-9\]+, s\[0-9\]" 7 } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr22141.c 
> b/gcc/testsuite/gcc.target/i386/pr22141.c
> new file mode 100644
> index 0000000..036422e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr22141.c
> @@ -0,0 +1,126 @@
> +/* PR middle-end/22141 */
> +/* { dg-do compile } */
> +/* { dg-options "-Os" } */
> +
> +extern void abort (void);
> +
> +struct S
> +{
> +  struct T
> +    {
> +      char a;
> +      char b;
> +      char c;
> +      char d;
> +    } t;
> +} u;
> +
> +struct U
> +{
> +  struct S s[4];
> +};
> +
> +void __attribute__((noinline))
> +c1 (struct T *p)
> +{
> +  if (p->a != 1 || p->b != 2 || p->c != 3 || p->d != 4)
> +    abort ();
> +  __builtin_memset (p, 0xaa, sizeof (*p));
> +}
> +
> +void __attribute__((noinline))
> +c2 (struct S *p)
> +{
> +  c1 (&p->t);
> +}
> +
> +void __attribute__((noinline))
> +c3 (struct U *p)
> +{
> +  c2 (&p->s[2]);
> +}
> +
> +void __attribute__((noinline))
> +f1 (void)
> +{
> +  u = (struct S) { { 1, 2, 3, 4 } };
> +}
> +
> +void __attribute__((noinline))
> +f2 (void)
> +{
> +  u.t.a = 1;
> +  u.t.b = 2;
> +  u.t.c = 3;
> +  u.t.d = 4;
> +}
> +
> +void __attribute__((noinline))
> +f3 (void)
> +{
> +  u.t.d = 4;
> +  u.t.b = 2;
> +  u.t.a = 1;
> +  u.t.c = 3;
> +}
> +
> +void __attribute__((noinline))
> +f4 (void)
> +{
> +  struct S v;
> +  v.t.a = 1;
> +  v.t.b = 2;
> +  v.t.c = 3;
> +  v.t.d = 4;
> +  c2 (&v);
> +}
> +
> +void __attribute__((noinline))
> +f5 (struct S *p)
> +{
> +  p->t.a = 1;
> +  p->t.c = 3;
> +  p->t.d = 4;
> +  p->t.b = 2;
> +}
> +
> +void __attribute__((noinline))
> +f6 (void)
> +{
> +  struct U v;
> +  v.s[2].t.a = 1;
> +  v.s[2].t.b = 2;
> +  v.s[2].t.c = 3;
> +  v.s[2].t.d = 4;
> +  c3 (&v);
> +}
> +
> +void __attribute__((noinline))
> +f7 (struct U *p)
> +{
> +  p->s[2].t.a = 1;
> +  p->s[2].t.c = 3;
> +  p->s[2].t.d = 4;
> +  p->s[2].t.b = 2;
> +}
> +
> +int
> +main (void)
> +{
> +  struct U w;
> +  f1 ();
> +  c2 (&u);
> +  f2 ();
> +  c1 (&u.t);
> +  f3 ();
> +  c2 (&u);
> +  f4 ();
> +  f5 (&u);
> +  c2 (&u);
> +  f6 ();
> +  f7 (&w);
> +  c3 (&w);
> +  return 0;
> +}
> +
> +/* { dg-final { scan-assembler-times "67305985\|4030201" 7 } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr34012.c 
> b/gcc/testsuite/gcc.target/i386/pr34012.c
> index 00b1240..d0cffa0 100644
> --- a/gcc/testsuite/gcc.target/i386/pr34012.c
> +++ b/gcc/testsuite/gcc.target/i386/pr34012.c
> @@ -1,7 +1,7 @@
>  /* PR rtl-optimization/34012 */
>  /* { dg-do compile } */
>  /* { dg-require-effective-target lp64 } */
> -/* { dg-options "-O2" } */
> +/* { dg-options "-O2 -fno-store-merging" } */
>  
>  void bar (long int *);
>  void
> diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
> index a706729..b5373a3 100644
> --- a/gcc/tree-pass.h
> +++ b/gcc/tree-pass.h
> @@ -425,6 +425,7 @@ extern gimple_opt_pass *make_pass_late_warn_uninitialized 
> (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_cse_reciprocals (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_cse_sincos (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_optimize_bswap (gcc::context *ctxt);
> +extern gimple_opt_pass *make_pass_store_merging (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_optimize_widening_mul (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_warn_function_return (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_warn_function_noreturn (gcc::context 
> *ctxt);

Re: [PATCH][v4] GIMPLE store merging pass

Reply via email to