Re: [PATCH] PR tree-optimization/98335: Improvements to DSE's compute_trims.

Jeff Law via Gcc-patches Tue, 08 Mar 2022 06:48:19 -0800



On 3/8/2022 3:10 AM, Richard Biener via Gcc-patches wrote:

On Mon, Mar 7, 2022 at 11:04 AM Roger Sayle <ro...@nextmovesoftware.com> wrote:


This patch is the main middle-end piece of a fix for PR tree-opt/98335,
which is a code-quality regression affecting mainline.  The issue occurs
in DSE's (dead store elimination's) compute_trims function that determines
where a store to memory can be trimmed.  In the testcase given in the
PR, this function notices that the first byte of a DImode store is dead,
and replaces the 8-byte store at (aligned) offset zero, with a 7-byte store
at (unaligned) offset one.  Most architectures can store a power-of-two
bytes (up to a maximum) in single instruction, so writing 7 bytes requires
more instructions than writing 8 bytes.  This patch follows Jakub Jelinek's
suggestion in comment 5, that compute_trims needs improved heuristics.

In this patch, decision of whether and how to align trim_head is based
on the number of bytes being written, the alignment of the start of the
object and where within the object the first byte is written.  The first
tests check whether we're already writing to the start of the object,
and that we're writing three or more bytes.  If we're only writing one
or two bytes, there's no benefit from providing additional alignment.
Then we determine the alignment of the object, which is either 1, 2,
4, 8 or 16 byte aligned (capping at 16 guarantees that we never write
more than 7 bytes beyond the minimum required).  If the buffer is only
1 or 2 byte aligned there's no benefit from additional alignment.  For
the remaining cases, alignment of trim_head is based upon where within
each aligned block (word) the first byte is written.  For example,
storing the last byte (or last half-word) of a word can be performed
with a single insn.

On x86_64-pc-linux-gnu with -O2 the new test case in the PR goes from:

         movl    $0, -24(%rsp)
         movabsq $72057594037927935, %rdx
         movl    $0, -21(%rsp)
         andq    -24(%rsp), %rdx
         movq    %rdx, %rax
         salq    $8, %rax
         movb    c(%rip), %al
         ret

to

         xorl    %eax, %eax
         movb    c(%rip), %al
         ret

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check with no new failures.  I've also added new testcases
for the original motivating PR tree-optimization/86010, to ensure that
those remain optimized (in future).  Ok for mainline?

diff --git a/gcc/tree-ssa-dse.cc b/gcc/tree-ssa-dse.cc
index 2b22a61..080e406 100644
--- a/gcc/tree-ssa-dse.cc
+++ b/gcc/tree-ssa-dse.cc
@@ -405,10 +405,36 @@ compute_trims (ao_ref *ref, sbitmap live, int
*trim_head, int *trim_tail,
    int first_live = bitmap_first_set_bit (live);
    *trim_head = first_live - first_orig;

-  /* If more than a word remains, then make sure to keep the
-     starting point at least word aligned.  */
-  if (last_live - first_live > UNITS_PER_WORD)
-    *trim_head &= ~(UNITS_PER_WORD - 1);
+  /* If REF is aligned, try to maintain this alignment if it reduces
+     the number of (power-of-two sized aligned) writes to memory.
+     First check that we're writing >= 3 bytes at a non-zero offset.  */
+  if (first_live
+      && last_live - first_live >= 2)
+    {
+      unsigned int align = TYPE_ALIGN_UNIT (TREE_TYPE (ref->base));

you can't simply use TYPE_ALIGN_* on ref->base.  You can use
get_object_alignment on ref->ref, but ref->ref can be NULL in case the
ref was initialized from a builtin call like memcpy.

Also ref->base is offsetted by ref->offset which you don't seem to
account for.  In theory one could export get_object_alignment_2 and
if ref->ref is NULL, use that on ref->base, passing addr_p = true,
and then adjust the resulting bitpos by ref->offset and fix align accordingly
(trimming might also align an access if the original access was offsetted
from known alignment).

That said, a helper like ao_ref_alignment () might be useful here.

I wonder if we can apply good heuristics to compute_trims without taking
into account context, like maybe_trimp_complex_store is already
limiting itself to useful subsets and the constructor and memstar cases
will only benefit if they end up being expanded inline via *_by_pieces,
not if expanded as a call.

Trimming heuristics are minimal and localized to what we know about agiven statement. Alignment is probably the most important thing we needto keep in mind.

For complex trimming really means removing one of two loads/stores andwas the motivating case behind introducing trimming to begin with --the memcpy stuff and friends came along for the ride later.

For trimming on mem*, str*, it can still be beneficial if they're notexpanded by pieces, though it's more beneficial to trim in cases wherethey get expanded by pieces.


You don't seem to adjust *trim_tail at all, if an aligned 16 byte region
is trimmed there by 3 that will result in two extra stores as well, no?

It would and that's probably bad.

Jeff

Re: [PATCH] PR tree-optimization/98335: Improvements to DSE's compute_trims.

Reply via email to