Thanks everyone — I just applied the series (including the requested)
changes to trunk.
We'll also send the committed series (v4) to the list for archival.

Philipp.


On Tue, 20 May 2025 at 14:31, Richard Sandiford
<richard.sandif...@arm.com> wrote:
>
> Konstantinos Eleftheriou <konstantinos.elefther...@vrull.eu> writes:
> > This patch uses `lowpart_subreg` for the base register initialization,
> > instead of zero-extending it. We had tried this solution before, but
> > we were leaving undefined bytes in the upper part of the register.
> > This shouldn't be happening as we are supposed to write the whole
> > register when the load is eliminated. This was occurring when having
> > multiple stores with the same offset as the load, generating a
> > register move for all of them, overwriting the bit inserts that
> > were inserted before them.
> >
> > In order to overcome this, we are removing redundant stores from the 
> > sequence,
> > i.e. stores that write to addresses that will be overwritten by stores that
> > come after them in the sequence. We are using the same bitmap that is used
> > for the load elimination check, to keep track of the bytes that are written
> > by each store.
> >
> > Also, we are now allowing the load to be eliminated even when there are
> > overlaps between the stores, as there is no obvious reason why we shouldn't
> > do that, we just want the stores to cover all of the load's bytes.
> >
> > Bootstrapped/regtested on AArch64 and x86_64.
> >
> >         PR rtl-optimization/119884
> >
> > gcc/ChangeLog:
> >
> >         * avoid-store-forwarding.cc (process_store_forwarding):
> >       Use `lowpart_subreg` for the base register initialization,
> >       and remove redundant stores from the store/load sequence.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         * gcc.target/i386/pr119884.c: New test.
> >
> > Signed-off-by: Konstantinos Eleftheriou <konstantinos.elefther...@vrull.eu>
> > ---
> >
> > Changes in v3:
> > - Remove redundant stores, instead of generating a register move for
> > the first store that has the same offset as the load only.
> >
> > Changes in v2:
> > - Use `lowpart_subreg` for the base register initialization, but
> > only for the first store that has the same offset as the load.
> >
> > Changes in v1:
> > - Add a check for the register modes to match before calling 
> > `emit_mov_insn`.
> >
> >  gcc/avoid-store-forwarding.cc            | 45 ++++++++++++++++++------
> >  gcc/testsuite/gcc.target/i386/pr119884.c | 13 +++++++
> >  2 files changed, 48 insertions(+), 10 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr119884.c
> >
> > diff --git a/gcc/avoid-store-forwarding.cc b/gcc/avoid-store-forwarding.cc
> > index 5d960adec359..f88a001e5717 100644
> > --- a/gcc/avoid-store-forwarding.cc
> > +++ b/gcc/avoid-store-forwarding.cc
> > @@ -176,20 +176,28 @@ process_store_forwarding (vec<store_fwd_info> 
> > &stores, rtx_insn *load_insn,
> >    /* Memory sizes should be constants at this stage.  */
> >    HOST_WIDE_INT load_size = MEM_SIZE (load_mem).to_constant ();
> >
> > -  /* If the stores cover all the bytes of the load without overlap then we 
> > can
> > -     eliminate the load entirely and use the computed value instead.  */
> > +  /* If the stores cover all the bytes of the load, then we can eliminate
> > +     the load entirely and use the computed value instead.
> > +     We can also eliminate stores on addresses that are overwritten
> > +     by later stores.  */
> >
> >    sbitmap forwarded_bytes = sbitmap_alloc (load_size);
> >    bitmap_clear (forwarded_bytes);
> >
> >    unsigned int i;
> >    store_fwd_info* it;
> > +  auto_vec<store_fwd_info> redundant_stores;
> > +  auto_vec<int> store_ind_to_remove;
> >    FOR_EACH_VEC_ELT (stores, i, it)
> >      {
> >        HOST_WIDE_INT store_size = MEM_SIZE (it->store_mem).to_constant ();
> > -      if (bitmap_bit_in_range_p (forwarded_bytes, it->offset,
> > +      if (bitmap_is_range_set_p (forwarded_bytes, it->offset,
> >                                it->offset + store_size - 1))
> > -     break;
> > +     {
> > +       redundant_stores.safe_push (*it);
> > +       store_ind_to_remove.safe_push (i);
> > +       continue;
> > +     }
> >        bitmap_set_range (forwarded_bytes, it->offset, store_size);
> >      }
> >
> > @@ -215,6 +223,11 @@ process_store_forwarding (vec<store_fwd_info> &stores, 
> > rtx_insn *load_insn,
> >       fprintf (dump_file, "(Load elimination candidate)\n");
> >      }
> >
> > +  /* Remove redundant stores from the vector.  */
> > +  store_ind_to_remove.reverse ();
> > +  for (int i : store_ind_to_remove)
> > +    stores.ordered_remove (i);
> > +
>
> This is quadratic.  That probably doesn't matter in practice though,
> since the dependence checking is already quadratic, and the size is
> already limited by a --param.  But I think it's worth at least a comment.
> Maybe:
>
>   /* Remove redundant stores from the vector.  ??? Although this is
>      quadratic, there doesn't to be seem much point optimizing it.
>      The number of redundant stores is expected to be low and the length
>      of the list is limited by a --param.  The dependence checking that
>      we did earlier is also quadratic in the size of this list.  */
>
> From my POV, the patch is OK for trunk with that change (and with the
> obvious rename after the comments on patch 2).  Please leave a couple
> of days for others to comment though.
>
> Thanks,
> Richard

Reply via email to