Re: [TCWG CI] 471.omnetpp slowed down by 8% after gcc: Avoid invalid loop transformations in jump threading registry.

Maxim Kuvyrkov via Gcc Mon, 27 Sep 2021 08:40:13 -0700

> On 27 Sep 2021, at 16:52, Aldy Hernandez <[email protected]> wrote:
> 
> [CCing Jeff and list for broader audience]
> 
> On 9/27/21 2:53 PM, Maxim Kuvyrkov wrote:
>> Hi Aldy,
>> Your patch seems to slow down 471.omnetpp by 8% at -O3.  Could you please 
>> take a look if this is something that could be easily fixed?
> 
> First of all, thanks for chasing this down.  It's incredibly useful to have 
> these types of bug reports.


Thanks, Aldy, this is music to my ears :-).

We have built this automated benchmarking CI that bisects code-speed and 
code-size regressions down to a single commit.  It is still work-in-progress, 
and I’m forwarding these reports to patch authors, whose patches caused 
regressions.  If GCC community finds these useful, we can also setup posting to 
one of GCC’s mailing lists.

> 
> Jeff and I have been discussing the repercussions of adjusting the loop 
> crossing restrictions in the various threaders.  He's seen some regressions 
> in embedded targets when disallowing certain corner cases of loop crossing 
> threads causes all sorts of grief.
> 
> Out of curiosity, does the attached (untested) patch fix the regression?

I’ll test the patch and will follow up.

Regards,

--
Maxim Kuvyrkov
https://www.linaro.org


> 
> Aldy
> 
>> Regards,
>> --
>> Maxim Kuvyrkov
>> https://www.linaro.org
>>> On 27 Sep 2021, at 02:52, [email protected] wrote:
>>> 
>>> After gcc commit 4a960d548b7d7d942f316c5295f6d849b74214f5
>>> Author: Aldy Hernandez <[email protected]>
>>> 
>>>    Avoid invalid loop transformations in jump threading registry.
>>> 
>>> the following benchmarks slowed down by more than 2%:
>>> - 471.omnetpp slowed down by 8% from 6348 to 6828 perf samples
>>> 
>>> Below reproducer instructions can be used to re-build both "first_bad" and 
>>> "last_good" cross-toolchains used in this bisection.  Naturally, the 
>>> scripts will fail when triggerring benchmarking jobs if you don't have 
>>> access to Linaro TCWG CI.
>>> 
>>> For your convenience, we have uploaded tarballs with pre-processed source 
>>> and assembly files at:
>>> - First_bad save-temps: 
>>> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3/40/artifact/artifacts/build-4a960d548b7d7d942f316c5295f6d849b74214f5/save-temps/
>>> - Last_good save-temps: 
>>> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3/40/artifact/artifacts/build-29c92857039d0a105281be61c10c9e851aaeea4a/save-temps/
>>> - Baseline save-temps: 
>>> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3/40/artifact/artifacts/build-baseline/save-temps/
>>> 
>>> Configuration:
>>> - Benchmark: SPEC CPU2006
>>> - Toolchain: GCC + Glibc + GNU Linker
>>> - Version: all components were built from their tip of trunk
>>> - Target: arm-linux-gnueabihf
>>> - Compiler flags: -O3 -marm
>>> - Hardware: NVidia TK1 4x Cortex-A15
>>> 
>>> This benchmarking CI is work-in-progress, and we welcome feedback and 
>>> suggestions at [email protected] .  In our improvement 
>>> plans is to add support for SPEC CPU2017 benchmarks and provide "perf 
>>> report/annotate" data behind these reports.
>>> 
>>> THIS IS THE END OF INTERESTING STUFF.  BELOW ARE LINKS TO BUILDS, 
>>> REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
>>> 
>>> This commit has regressed these CI configurations:
>>> - tcwg_bmk_gnu_tk1/gnu-master-arm-spec2k6-O3
>>> 
>>> First_bad build: 
>>> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3/40/artifact/artifacts/build-4a960d548b7d7d942f316c5295f6d849b74214f5/
>>> Last_good build: 
>>> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3/40/artifact/artifacts/build-29c92857039d0a105281be61c10c9e851aaeea4a/
>>> Baseline build: 
>>> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3/40/artifact/artifacts/build-baseline/
>>> Even more details: 
>>> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3/40/artifact/artifacts/
>>> 
>>> Reproduce builds:
>>> <cut>
>>> mkdir investigate-gcc-4a960d548b7d7d942f316c5295f6d849b74214f5
>>> cd investigate-gcc-4a960d548b7d7d942f316c5295f6d849b74214f5
>>> 
>>> # Fetch scripts
>>> git clone https://git.linaro.org/toolchain/jenkins-scripts
>>> 
>>> # Fetch manifests and test.sh script
>>> mkdir -p artifacts/manifests
>>> curl -o artifacts/manifests/build-baseline.sh 
>>> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3/40/artifact/artifacts/manifests/build-baseline.sh
>>>  --fail
>>> curl -o artifacts/manifests/build-parameters.sh 
>>> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3/40/artifact/artifacts/manifests/build-parameters.sh
>>>  --fail
>>> curl -o artifacts/test.sh 
>>> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3/40/artifact/artifacts/test.sh
>>>  --fail
>>> chmod +x artifacts/test.sh
>>> 
>>> # Reproduce the baseline build (build all pre-requisites)
>>> ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
>>> 
>>> # Save baseline build state (which is then restored in artifacts/test.sh)
>>> mkdir -p ./bisect
>>> rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ 
>>> --exclude /gcc/ ./ ./bisect/baseline/
>>> 
>>> cd gcc
>>> 
>>> # Reproduce first_bad build
>>> git checkout --detach 4a960d548b7d7d942f316c5295f6d849b74214f5
>>> ../artifacts/test.sh
>>> 
>>> # Reproduce last_good build
>>> git checkout --detach 29c92857039d0a105281be61c10c9e851aaeea4a
>>> ../artifacts/test.sh
>>> 
>>> cd ..
>>> </cut>
>>> 
>>> Full commit (up to 1000 lines):
>>> <cut>
>>> commit 4a960d548b7d7d942f316c5295f6d849b74214f5
>>> Author: Aldy Hernandez <[email protected]>
>>> Date:   Thu Sep 23 10:59:24 2021 +0200
>>> 
>>>    Avoid invalid loop transformations in jump threading registry.
>>> 
>>>    My upcoming improvements to the forward jump threader make it thread
>>>    more aggressively.  In investigating some "regressions", I noticed
>>>    that it has always allowed threading through empty latches and across
>>>    loop boundaries.  As we have discussed recently, this should be avoided
>>>    until after loop optimizations have run their course.
>>> 
>>>    Note that this wasn't much of a problem before because DOM/VRP
>>>    couldn't find these opportunities, but with a smarter solver, we trip
>>>    over them more easily.
>>> 
>>>    Because the forward threader doesn't have an independent localized cost
>>>    model like the new threader (profitable_path_p), it is difficult to
>>>    catch these things at discovery.  However, we can catch them at
>>>    registration time, with the added benefit that all the threaders
>>>    (forward and backward) can share the handcuffs.
>>> 
>>>    This patch is an adaptation of what we do in the backward threader, but
>>>    it is not meant to catch everything we do there, as some of the
>>>    restrictions there are due to limitations of the different block
>>>    copiers (for example, the generic copier does not re-use existing
>>>    threading paths).
>>> 
>>>    We could ideally remove the now redundant bits in profitable_path_p, but
>>>    I would prefer not to for two reasons.  First, the backward threader uses
>>>    profitable_path_p as it discovers paths to avoid discovering paths in
>>>    unprofitable directions.  Second, I would like to merge all the forward
>>>    cost restrictions into the profitability class in the backward threader,
>>>    not the other way around.  Alas, that reshuffling will have to wait for
>>>    the next release.
>>> 
>>>    As usual, there are quite a few tests that needed adjustments.  It seems
>>>    we were quite happily threading improper scenarios.  With most of them,
>>>    as can be seen in pr77445-2.c, we're merely shifting the threading to
>>>    after loop optimizations.
>>> 
>>>    Tested on x86-64 Linux.
>>> 
>>>    gcc/ChangeLog:
>>> 
>>>            * tree-ssa-threadupdate.c 
>>> (jt_path_registry::cancel_invalid_paths):
>>>            New.
>>>            (jt_path_registry::register_jump_thread): Call
>>>            cancel_invalid_paths.
>>>            * tree-ssa-threadupdate.h (class jt_path_registry): Add
>>>            cancel_invalid_paths.
>>> 
>>>    gcc/testsuite/ChangeLog:
>>> 
>>>            * gcc.dg/tree-ssa/20030714-2.c: Adjust.
>>>            * gcc.dg/tree-ssa/pr66752-3.c: Adjust.
>>>            * gcc.dg/tree-ssa/pr77445-2.c: Adjust.
>>>            * gcc.dg/tree-ssa/ssa-dom-thread-18.c: Adjust.
>>>            * gcc.dg/tree-ssa/ssa-dom-thread-7.c: Adjust.
>>>            * gcc.dg/vect/bb-slp-16.c: Adjust.
>>> ---
>>> gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c        |  7 ++-
>>> gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c         | 19 ++++---
>>> gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c         |  4 +-
>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c |  4 +-
>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c  |  4 +-
>>> gcc/testsuite/gcc.dg/vect/bb-slp-16.c             |  7 ---
>>> gcc/tree-ssa-threadupdate.c                       | 67 
>>> ++++++++++++++++++-----
>>> gcc/tree-ssa-threadupdate.h                       |  1 +
>>> 8 files changed, 78 insertions(+), 35 deletions(-)
>>> 
>>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c 
>>> b/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c
>>> index eb663f2ff5b..9585ff11307 100644
>>> --- a/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c
>>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c
>>> @@ -32,7 +32,8 @@ get_alias_set (t)
>>>     }
>>> }
>>> 
>>> -/* There should be exactly three IF conditionals if we thread jumps
>>> -   properly.  */
>>> -/* { dg-final { scan-tree-dump-times "if " 3 "dom2"} } */
>>> +/* There should be exactly 4 IF conditionals if we thread jumps
>>> +   properly.  There used to be 3, but one thread was crossing
>>> +   loops.  */
>>> +/* { dg-final { scan-tree-dump-times "if " 4 "dom2"} } */
>>> 
>>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c 
>>> b/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c
>>> index e1464e21170..922a331b217 100644
>>> --- a/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c
>>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c
>>> @@ -1,5 +1,5 @@
>>> /* { dg-do compile } */
>>> -/* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-dce2" } */
>>> +/* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-thread3" } */
>>> 
>>> extern int status, pt;
>>> extern int count;
>>> @@ -32,10 +32,15 @@ foo (int N, int c, int b, int *a)
>>>    pt--;
>>> }
>>> 
>>> -/* There are 4 jump threading opportunities, all of which will be
>>> -   realized, which will eliminate testing of FLAG, completely.  */
>>> -/* { dg-final { scan-tree-dump-times "Registering jump" 4 "thread1"} } */
>>> +/* There are 2 jump threading opportunities (which don't cross loops),
>>> +   all of which will be realized, which will eliminate testing of
>>> +   FLAG, completely.  */
>>> +/* { dg-final { scan-tree-dump-times "Registering jump" 2 "thread1"} } */
>>> 
>>> -/* There should be no assignments or references to FLAG, verify they're
>>> -   eliminated as early as possible.  */
>>> -/* { dg-final { scan-tree-dump-not "if .flag" "dce2"} } */
>>> +/* We used to remove references to FLAG by DCE2, but this was
>>> +   depending on early threaders threading through loop boundaries
>>> +   (which we shouldn't do).  However, the late threading passes, which
>>> +   run after loop optimizations , can successfully eliminate the
>>> +   references to FLAG.  Verify that ther are no references by the late
>>> +   threading passes.  */
>>> +/* { dg-final { scan-tree-dump-not "if .flag" "thread3"} } */
>>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c 
>>> b/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c
>>> index f9fc212f49e..01a0f1f197d 100644
>>> --- a/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c
>>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c
>>> @@ -123,8 +123,8 @@ enum STATES FMS( u8 **in , u32 *transitions) {
>>>    aarch64 has the highest CASE_VALUES_THRESHOLD in GCC.  It's high enough
>>>    to change decisions in switch expansion which in turn can expose new
>>>    jump threading opportunities.  Skip the later tests on aarch64.  */
>>> -/* { dg-final { scan-tree-dump "Jumps threaded: 1\[1-9\]" "thread1" } } */
>>> -/* { dg-final { scan-tree-dump-times "Invalid sum" 4 "thread1" } } */
>>> +/* { dg-final { scan-tree-dump "Jumps threaded: 9" "thread1" } } */
>>> +/* { dg-final { scan-tree-dump-times "Invalid sum" 1 "thread1" } } */
>>> /* { dg-final { scan-tree-dump-not "optimizing for size" "thread1" } } */
>>> /* { dg-final { scan-tree-dump-not "optimizing for size" "thread2" } } */
>>> /* { dg-final { scan-tree-dump-not "optimizing for size" "thread3" { target 
>>> { ! aarch64*-*-* } } } } */
>>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c 
>>> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c
>>> index 60d4f76f076..2d78d045516 100644
>>> --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c
>>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c
>>> @@ -21,5 +21,7 @@
>>>      condition.
>>> 
>>>    All the cases are picked up by VRP1 as jump threads.  */
>>> -/* { dg-final { scan-tree-dump-times "Registering jump" 6 "thread1" } } */
>>> +
>>> +/* There used to be 6 jump threads found by thread1, but they all
>>> +   depended on threading through distinct loops in ethread.  */
>>> /* { dg-final { scan-tree-dump-times "Threaded" 2 "vrp1" } } */
>>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c 
>>> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
>>> index e3d4b311c03..16abcde5053 100644
>>> --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
>>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
>>> @@ -1,8 +1,8 @@
>>> /* { dg-do compile } */
>>> /* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-thread2-stats 
>>> -fdump-tree-dom2-stats -fdump-tree-thread3-stats -fdump-tree-dom3-stats 
>>> -fdump-tree-vrp2-stats -fno-guess-branch-probability" } */
>>> 
>>> -/* { dg-final { scan-tree-dump "Jumps threaded: 18"  "thread1" } } */
>>> -/* { dg-final { scan-tree-dump "Jumps threaded: 8" "thread3" { target { ! 
>>> aarch64*-*-* } } } } */
>>> +/* { dg-final { scan-tree-dump "Jumps threaded: 12"  "thread1" } } */
>>> +/* { dg-final { scan-tree-dump "Jumps threaded: 5" "thread3" { target { ! 
>>> aarch64*-*-* } } } } */
>>> /* { dg-final { scan-tree-dump-not "Jumps threaded"  "dom2" } } */
>>> 
>>> /* aarch64 has the highest CASE_VALUES_THRESHOLD in GCC.  It's high enough
>>> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c 
>>> b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c
>>> index 664e93e9b60..e68a9b62535 100644
>>> --- a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c
>>> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c
>>> @@ -1,8 +1,5 @@
>>> /* { dg-require-effective-target vect_int } */
>>> 
>>> -/* See note below as to why we disable threading.  */
>>> -/* { dg-additional-options "-fdisable-tree-thread1" } */
>>> -
>>> #include <stdarg.h>
>>> #include "tree-vect.h"
>>> 
>>> @@ -30,10 +27,6 @@ main1 (int dummy)
>>>       *pout++ = *pin++ + a;
>>>       *pout++ = *pin++ + a;
>>>       *pout++ = *pin++ + a;
>>> -      /* In some architectures like ppc64, jump threading may thread
>>> -    the iteration where i==0 such that we no longer optimize the
>>> -    BB.  Another alternative to disable jump threading would be
>>> -    to wrap the read from `i' into a function returning i.  */
>>>       if (arr[i] = i)
>>>         a = i;
>>>       else
>>> diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c
>>> index baac11280fa..2b9b8f81274 100644
>>> --- a/gcc/tree-ssa-threadupdate.c
>>> +++ b/gcc/tree-ssa-threadupdate.c
>>> @@ -2757,6 +2757,58 @@ fwd_jt_path_registry::update_cfg (bool 
>>> may_peel_loop_headers)
>>>   return retval;
>>> }
>>> 
>>> +bool
>>> +jt_path_registry::cancel_invalid_paths (vec<jump_thread_edge *> &path)
>>> +{
>>> +  gcc_checking_assert (!path.is_empty ());
>>> +  edge taken_edge = path[path.length () - 1]->e;
>>> +  loop_p loop = taken_edge->src->loop_father;
>>> +  bool seen_latch = false;
>>> +  bool path_crosses_loops = false;
>>> +
>>> +  for (unsigned int i = 0; i < path.length (); i++)
>>> +    {
>>> +      edge e = path[i]->e;
>>> +
>>> +      if (e == NULL)
>>> +   {
>>> +     // NULL outgoing edges on a path can happen for jumping to a
>>> +     // constant address.
>>> +     cancel_thread (&path, "Found NULL edge in jump threading path");
>>> +     return true;
>>> +   }
>>> +
>>> +      if (loop->latch == e->src || loop->latch == e->dest)
>>> +   seen_latch = true;
>>> +
>>> +      // The first entry represents the block with an outgoing edge
>>> +      // that we will redirect to the jump threading path.  Thus we
>>> +      // don't care about that block's loop father.
>>> +      if ((i > 0 && e->src->loop_father != loop)
>>> +     || e->dest->loop_father != loop)
>>> +   path_crosses_loops = true;
>>> +
>>> +      if (flag_checking && !m_backedge_threads)
>>> +   gcc_assert ((path[i]->e->flags & EDGE_DFS_BACK) == 0);
>>> +    }
>>> +
>>> +  if (cfun->curr_properties & PROP_loop_opts_done)
>>> +    return false;
>>> +
>>> +  if (seen_latch && empty_block_p (loop->latch))
>>> +    {
>>> +      cancel_thread (&path, "Threading through latch before loop opts "
>>> +                "would create non-empty latch");
>>> +      return true;
>>> +    }
>>> +  if (path_crosses_loops)
>>> +    {
>>> +      cancel_thread (&path, "Path crosses loops");
>>> +      return true;
>>> +    }
>>> +  return false;
>>> +}
>>> +
>>> /* Register a jump threading opportunity.  We queue up all the jump
>>>    threading opportunities discovered by a pass and update the CFG
>>>    and SSA form all at once.
>>> @@ -2776,19 +2828,8 @@ jt_path_registry::register_jump_thread 
>>> (vec<jump_thread_edge *> *path)
>>>       return false;
>>>     }
>>> 
>>> -  /* First make sure there are no NULL outgoing edges on the jump threading
>>> -     path.  That can happen for jumping to a constant address.  */
>>> -  for (unsigned int i = 0; i < path->length (); i++)
>>> -    {
>>> -      if ((*path)[i]->e == NULL)
>>> -   {
>>> -     cancel_thread (path, "Found NULL edge in jump threading path");
>>> -     return false;
>>> -   }
>>> -
>>> -      if (flag_checking && !m_backedge_threads)
>>> -   gcc_assert (((*path)[i]->e->flags & EDGE_DFS_BACK) == 0);
>>> -    }
>>> +  if (cancel_invalid_paths (*path))
>>> +    return false;
>>> 
>>>   if (dump_file && (dump_flags & TDF_DETAILS))
>>>     dump_jump_thread_path (dump_file, *path, true);
>>> diff --git a/gcc/tree-ssa-threadupdate.h b/gcc/tree-ssa-threadupdate.h
>>> index 8b48a671212..d68795c9f27 100644
>>> --- a/gcc/tree-ssa-threadupdate.h
>>> +++ b/gcc/tree-ssa-threadupdate.h
>>> @@ -75,6 +75,7 @@ protected:
>>>   unsigned long m_num_threaded_edges;
>>> private:
>>>   virtual bool update_cfg (bool peel_loop_headers) = 0;
>>> +  bool cancel_invalid_paths (vec<jump_thread_edge *> &path);
>>>   jump_thread_path_allocator m_allocator;
>>>   // True if threading through back edges is allowed.  This is only
>>>   // allowed in the generic copier in the backward threader.
>>> </cut>
> <jeff.txt>

Re: [TCWG CI] 471.omnetpp slowed down by 8% after gcc: Avoid invalid loop transformations in jump threading registry.

Reply via email to