Re: [PATCH] PR tree-optimization/98335: Improvements to DSE's compute_trims.

Richard Biener via Gcc-patches Tue, 08 Mar 2022 02:45:05 -0800

On Tue, Mar 8, 2022 at 11:10 AM Richard Biener
<richard.guent...@gmail.com> wrote:
>
> On Mon, Mar 7, 2022 at 11:04 AM Roger Sayle <ro...@nextmovesoftware.com> 
> wrote:
> >
> >
> > This patch is the main middle-end piece of a fix for PR tree-opt/98335,
> > which is a code-quality regression affecting mainline.  The issue occurs
> > in DSE's (dead store elimination's) compute_trims function that determines
> > where a store to memory can be trimmed.  In the testcase given in the
> > PR, this function notices that the first byte of a DImode store is dead,
> > and replaces the 8-byte store at (aligned) offset zero, with a 7-byte store
> > at (unaligned) offset one.  Most architectures can store a power-of-two
> > bytes (up to a maximum) in single instruction, so writing 7 bytes requires
> > more instructions than writing 8 bytes.  This patch follows Jakub Jelinek's
> > suggestion in comment 5, that compute_trims needs improved heuristics.
> >
> > In this patch, decision of whether and how to align trim_head is based
> > on the number of bytes being written, the alignment of the start of the
> > object and where within the object the first byte is written.  The first
> > tests check whether we're already writing to the start of the object,
> > and that we're writing three or more bytes.  If we're only writing one
> > or two bytes, there's no benefit from providing additional alignment.
> > Then we determine the alignment of the object, which is either 1, 2,
> > 4, 8 or 16 byte aligned (capping at 16 guarantees that we never write
> > more than 7 bytes beyond the minimum required).  If the buffer is only
> > 1 or 2 byte aligned there's no benefit from additional alignment.  For
> > the remaining cases, alignment of trim_head is based upon where within
> > each aligned block (word) the first byte is written.  For example,
> > storing the last byte (or last half-word) of a word can be performed
> > with a single insn.
> >
> > On x86_64-pc-linux-gnu with -O2 the new test case in the PR goes from:
> >
> >         movl    $0, -24(%rsp)
> >         movabsq $72057594037927935, %rdx
> >         movl    $0, -21(%rsp)
> >         andq    -24(%rsp), %rdx
> >         movq    %rdx, %rax
> >         salq    $8, %rax
> >         movb    c(%rip), %al
> >         ret
> >
> > to
> >
> >         xorl    %eax, %eax
> >         movb    c(%rip), %al
> >         ret
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check with no new failures.  I've also added new testcases
> > for the original motivating PR tree-optimization/86010, to ensure that
> > those remain optimized (in future).  Ok for mainline?
>
> diff --git a/gcc/tree-ssa-dse.cc b/gcc/tree-ssa-dse.cc
> index 2b22a61..080e406 100644
> --- a/gcc/tree-ssa-dse.cc
> +++ b/gcc/tree-ssa-dse.cc
> @@ -405,10 +405,36 @@ compute_trims (ao_ref *ref, sbitmap live, int
> *trim_head, int *trim_tail,
>    int first_live = bitmap_first_set_bit (live);
>    *trim_head = first_live - first_orig;
>
> -  /* If more than a word remains, then make sure to keep the
> -     starting point at least word aligned.  */
> -  if (last_live - first_live > UNITS_PER_WORD)
> -    *trim_head &= ~(UNITS_PER_WORD - 1);
> +  /* If REF is aligned, try to maintain this alignment if it reduces
> +     the number of (power-of-two sized aligned) writes to memory.
> +     First check that we're writing >= 3 bytes at a non-zero offset.  */
> +  if (first_live
> +      && last_live - first_live >= 2)
> +    {
> +      unsigned int align = TYPE_ALIGN_UNIT (TREE_TYPE (ref->base));
>
> you can't simply use TYPE_ALIGN_* on ref->base.  You can use
> get_object_alignment on ref->ref, but ref->ref can be NULL in case the
> ref was initialized from a builtin call like memcpy.
>
> Also ref->base is offsetted by ref->offset which you don't seem to
> account for.  In theory one could export get_object_alignment_2 and
> if ref->ref is NULL, use that on ref->base, passing addr_p = true,
> and then adjust the resulting bitpos by ref->offset and fix align accordingly
> (trimming might also align an access if the original access was offsetted
> from known alignment).
>
> That said, a helper like ao_ref_alignment () might be useful here.


Like the attached - free feel to use that.

Richard.

>
> I wonder if we can apply good heuristics to compute_trims without taking
> into account context, like maybe_trimp_complex_store is already
> limiting itself to useful subsets and the constructor and memstar cases
> will only benefit if they end up being expanded inline via *_by_pieces,
> not if expanded as a call.
>
> You don't seem to adjust *trim_tail at all, if an aligned 16 byte region
> is trimmed there by 3 that will result in two extra stores as well, no?
>
> +      if (DECL_P (ref->base) && DECL_ALIGN_UNIT (ref->base) > align)
> +       align = DECL_ALIGN_UNIT (ref->base);
> +      if (align > UNITS_PER_WORD)
> +       align = UNITS_PER_WORD;
> +      if (align > 16)
> +       align = 16;
> +      if (align > 2)
> +       {
> +         /* ALIGN is 4, 8 or 16.  */
> +         unsigned int low = first_live & (align - 1);
> +         if (low * 2 < align)
> +           {
> +             if (align == 16 && low >= 4 && last_live < 15)
> +               *trim_head &= ~3;
> +             else
> +               *trim_head &= ~(align - 1);
> +           }
> +         else if (low + 3 == align)
> +           *trim_head &= ~1;
> +         else if (low > 8 && low < 12)
> +           *trim_head &= ~3;
> +       }
> +    }
>
>    if ((*trim_head || *trim_tail)
>        && dump_file && (dump_flags & TDF_DETAILS))
>
>
>
> > 2022-03-07  Roger Sayle  <ro...@nextmovesoftware.com>
> >
> > gcc/ChangeLog
> >         PR tree-optimization/98335
> >         * tree-ssa-dse.cc (compute_trims): Improve logic deciding whether
> >         to align trim_head, writing more bytes by using fewer instructions.
> >
> > gcc/testsuite/ChangeLog
> >         PR tree-optimization/98335
> >         * g++.dg/pr98335.C: New test case.
> >         * gcc.dg/pr86010.c: New test case.
> >         * gcc.dg/pr86010-2.c: New test case.
> >
> > Thanks in advance,
> > Roger
> > --
> >

p
Description: Binary data

Re: [PATCH] PR tree-optimization/98335: Improvements to DSE's compute_trims.

Reply via email to