> On 27 Sep 2021, at 16:52, Aldy Hernandez <al...@redhat.com> wrote: > > [CCing Jeff and list for broader audience] > > On 9/27/21 2:53 PM, Maxim Kuvyrkov wrote: >> Hi Aldy, >> Your patch seems to slow down 471.omnetpp by 8% at -O3. Could you please >> take a look if this is something that could be easily fixed? > > First of all, thanks for chasing this down. It's incredibly useful to have > these types of bug reports.
Thanks, Aldy, this is music to my ears :-). We have built this automated benchmarking CI that bisects code-speed and code-size regressions down to a single commit. It is still work-in-progress, and I’m forwarding these reports to patch authors, whose patches caused regressions. If GCC community finds these useful, we can also setup posting to one of GCC’s mailing lists. > > Jeff and I have been discussing the repercussions of adjusting the loop > crossing restrictions in the various threaders. He's seen some regressions > in embedded targets when disallowing certain corner cases of loop crossing > threads causes all sorts of grief. > > Out of curiosity, does the attached (untested) patch fix the regression? I’ll test the patch and will follow up. Regards, -- Maxim Kuvyrkov https://www.linaro.org > > Aldy > >> Regards, >> -- >> Maxim Kuvyrkov >> https://www.linaro.org >>> On 27 Sep 2021, at 02:52, ci_not...@linaro.org wrote: >>> >>> After gcc commit 4a960d548b7d7d942f316c5295f6d849b74214f5 >>> Author: Aldy Hernandez <al...@redhat.com> >>> >>> Avoid invalid loop transformations in jump threading registry. >>> >>> the following benchmarks slowed down by more than 2%: >>> - 471.omnetpp slowed down by 8% from 6348 to 6828 perf samples >>> >>> Below reproducer instructions can be used to re-build both "first_bad" and >>> "last_good" cross-toolchains used in this bisection. Naturally, the >>> scripts will fail when triggerring benchmarking jobs if you don't have >>> access to Linaro TCWG CI. >>> >>> For your convenience, we have uploaded tarballs with pre-processed source >>> and assembly files at: >>> - First_bad save-temps: >>> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3/40/artifact/artifacts/build-4a960d548b7d7d942f316c5295f6d849b74214f5/save-temps/ >>> - Last_good save-temps: >>> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3/40/artifact/artifacts/build-29c92857039d0a105281be61c10c9e851aaeea4a/save-temps/ >>> - Baseline save-temps: >>> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3/40/artifact/artifacts/build-baseline/save-temps/ >>> >>> Configuration: >>> - Benchmark: SPEC CPU2006 >>> - Toolchain: GCC + Glibc + GNU Linker >>> - Version: all components were built from their tip of trunk >>> - Target: arm-linux-gnueabihf >>> - Compiler flags: -O3 -marm >>> - Hardware: NVidia TK1 4x Cortex-A15 >>> >>> This benchmarking CI is work-in-progress, and we welcome feedback and >>> suggestions at linaro-toolch...@lists.linaro.org . In our improvement >>> plans is to add support for SPEC CPU2017 benchmarks and provide "perf >>> report/annotate" data behind these reports. >>> >>> THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, >>> REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. >>> >>> This commit has regressed these CI configurations: >>> - tcwg_bmk_gnu_tk1/gnu-master-arm-spec2k6-O3 >>> >>> First_bad build: >>> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3/40/artifact/artifacts/build-4a960d548b7d7d942f316c5295f6d849b74214f5/ >>> Last_good build: >>> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3/40/artifact/artifacts/build-29c92857039d0a105281be61c10c9e851aaeea4a/ >>> Baseline build: >>> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3/40/artifact/artifacts/build-baseline/ >>> Even more details: >>> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3/40/artifact/artifacts/ >>> >>> Reproduce builds: >>> <cut> >>> mkdir investigate-gcc-4a960d548b7d7d942f316c5295f6d849b74214f5 >>> cd investigate-gcc-4a960d548b7d7d942f316c5295f6d849b74214f5 >>> >>> # Fetch scripts >>> git clone https://git.linaro.org/toolchain/jenkins-scripts >>> >>> # Fetch manifests and test.sh script >>> mkdir -p artifacts/manifests >>> curl -o artifacts/manifests/build-baseline.sh >>> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3/40/artifact/artifacts/manifests/build-baseline.sh >>> --fail >>> curl -o artifacts/manifests/build-parameters.sh >>> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3/40/artifact/artifacts/manifests/build-parameters.sh >>> --fail >>> curl -o artifacts/test.sh >>> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3/40/artifact/artifacts/test.sh >>> --fail >>> chmod +x artifacts/test.sh >>> >>> # Reproduce the baseline build (build all pre-requisites) >>> ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh >>> >>> # Save baseline build state (which is then restored in artifacts/test.sh) >>> mkdir -p ./bisect >>> rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ >>> --exclude /gcc/ ./ ./bisect/baseline/ >>> >>> cd gcc >>> >>> # Reproduce first_bad build >>> git checkout --detach 4a960d548b7d7d942f316c5295f6d849b74214f5 >>> ../artifacts/test.sh >>> >>> # Reproduce last_good build >>> git checkout --detach 29c92857039d0a105281be61c10c9e851aaeea4a >>> ../artifacts/test.sh >>> >>> cd .. >>> </cut> >>> >>> Full commit (up to 1000 lines): >>> <cut> >>> commit 4a960d548b7d7d942f316c5295f6d849b74214f5 >>> Author: Aldy Hernandez <al...@redhat.com> >>> Date: Thu Sep 23 10:59:24 2021 +0200 >>> >>> Avoid invalid loop transformations in jump threading registry. >>> >>> My upcoming improvements to the forward jump threader make it thread >>> more aggressively. In investigating some "regressions", I noticed >>> that it has always allowed threading through empty latches and across >>> loop boundaries. As we have discussed recently, this should be avoided >>> until after loop optimizations have run their course. >>> >>> Note that this wasn't much of a problem before because DOM/VRP >>> couldn't find these opportunities, but with a smarter solver, we trip >>> over them more easily. >>> >>> Because the forward threader doesn't have an independent localized cost >>> model like the new threader (profitable_path_p), it is difficult to >>> catch these things at discovery. However, we can catch them at >>> registration time, with the added benefit that all the threaders >>> (forward and backward) can share the handcuffs. >>> >>> This patch is an adaptation of what we do in the backward threader, but >>> it is not meant to catch everything we do there, as some of the >>> restrictions there are due to limitations of the different block >>> copiers (for example, the generic copier does not re-use existing >>> threading paths). >>> >>> We could ideally remove the now redundant bits in profitable_path_p, but >>> I would prefer not to for two reasons. First, the backward threader uses >>> profitable_path_p as it discovers paths to avoid discovering paths in >>> unprofitable directions. Second, I would like to merge all the forward >>> cost restrictions into the profitability class in the backward threader, >>> not the other way around. Alas, that reshuffling will have to wait for >>> the next release. >>> >>> As usual, there are quite a few tests that needed adjustments. It seems >>> we were quite happily threading improper scenarios. With most of them, >>> as can be seen in pr77445-2.c, we're merely shifting the threading to >>> after loop optimizations. >>> >>> Tested on x86-64 Linux. >>> >>> gcc/ChangeLog: >>> >>> * tree-ssa-threadupdate.c >>> (jt_path_registry::cancel_invalid_paths): >>> New. >>> (jt_path_registry::register_jump_thread): Call >>> cancel_invalid_paths. >>> * tree-ssa-threadupdate.h (class jt_path_registry): Add >>> cancel_invalid_paths. >>> >>> gcc/testsuite/ChangeLog: >>> >>> * gcc.dg/tree-ssa/20030714-2.c: Adjust. >>> * gcc.dg/tree-ssa/pr66752-3.c: Adjust. >>> * gcc.dg/tree-ssa/pr77445-2.c: Adjust. >>> * gcc.dg/tree-ssa/ssa-dom-thread-18.c: Adjust. >>> * gcc.dg/tree-ssa/ssa-dom-thread-7.c: Adjust. >>> * gcc.dg/vect/bb-slp-16.c: Adjust. >>> --- >>> gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c | 7 ++- >>> gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c | 19 ++++--- >>> gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c | 4 +- >>> gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c | 4 +- >>> gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c | 4 +- >>> gcc/testsuite/gcc.dg/vect/bb-slp-16.c | 7 --- >>> gcc/tree-ssa-threadupdate.c | 67 >>> ++++++++++++++++++----- >>> gcc/tree-ssa-threadupdate.h | 1 + >>> 8 files changed, 78 insertions(+), 35 deletions(-) >>> >>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c >>> b/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c >>> index eb663f2ff5b..9585ff11307 100644 >>> --- a/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c >>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c >>> @@ -32,7 +32,8 @@ get_alias_set (t) >>> } >>> } >>> >>> -/* There should be exactly three IF conditionals if we thread jumps >>> - properly. */ >>> -/* { dg-final { scan-tree-dump-times "if " 3 "dom2"} } */ >>> +/* There should be exactly 4 IF conditionals if we thread jumps >>> + properly. There used to be 3, but one thread was crossing >>> + loops. */ >>> +/* { dg-final { scan-tree-dump-times "if " 4 "dom2"} } */ >>> >>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c >>> b/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c >>> index e1464e21170..922a331b217 100644 >>> --- a/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c >>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c >>> @@ -1,5 +1,5 @@ >>> /* { dg-do compile } */ >>> -/* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-dce2" } */ >>> +/* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-thread3" } */ >>> >>> extern int status, pt; >>> extern int count; >>> @@ -32,10 +32,15 @@ foo (int N, int c, int b, int *a) >>> pt--; >>> } >>> >>> -/* There are 4 jump threading opportunities, all of which will be >>> - realized, which will eliminate testing of FLAG, completely. */ >>> -/* { dg-final { scan-tree-dump-times "Registering jump" 4 "thread1"} } */ >>> +/* There are 2 jump threading opportunities (which don't cross loops), >>> + all of which will be realized, which will eliminate testing of >>> + FLAG, completely. */ >>> +/* { dg-final { scan-tree-dump-times "Registering jump" 2 "thread1"} } */ >>> >>> -/* There should be no assignments or references to FLAG, verify they're >>> - eliminated as early as possible. */ >>> -/* { dg-final { scan-tree-dump-not "if .flag" "dce2"} } */ >>> +/* We used to remove references to FLAG by DCE2, but this was >>> + depending on early threaders threading through loop boundaries >>> + (which we shouldn't do). However, the late threading passes, which >>> + run after loop optimizations , can successfully eliminate the >>> + references to FLAG. Verify that ther are no references by the late >>> + threading passes. */ >>> +/* { dg-final { scan-tree-dump-not "if .flag" "thread3"} } */ >>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c >>> b/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c >>> index f9fc212f49e..01a0f1f197d 100644 >>> --- a/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c >>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c >>> @@ -123,8 +123,8 @@ enum STATES FMS( u8 **in , u32 *transitions) { >>> aarch64 has the highest CASE_VALUES_THRESHOLD in GCC. It's high enough >>> to change decisions in switch expansion which in turn can expose new >>> jump threading opportunities. Skip the later tests on aarch64. */ >>> -/* { dg-final { scan-tree-dump "Jumps threaded: 1\[1-9\]" "thread1" } } */ >>> -/* { dg-final { scan-tree-dump-times "Invalid sum" 4 "thread1" } } */ >>> +/* { dg-final { scan-tree-dump "Jumps threaded: 9" "thread1" } } */ >>> +/* { dg-final { scan-tree-dump-times "Invalid sum" 1 "thread1" } } */ >>> /* { dg-final { scan-tree-dump-not "optimizing for size" "thread1" } } */ >>> /* { dg-final { scan-tree-dump-not "optimizing for size" "thread2" } } */ >>> /* { dg-final { scan-tree-dump-not "optimizing for size" "thread3" { target >>> { ! aarch64*-*-* } } } } */ >>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c >>> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c >>> index 60d4f76f076..2d78d045516 100644 >>> --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c >>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c >>> @@ -21,5 +21,7 @@ >>> condition. >>> >>> All the cases are picked up by VRP1 as jump threads. */ >>> -/* { dg-final { scan-tree-dump-times "Registering jump" 6 "thread1" } } */ >>> + >>> +/* There used to be 6 jump threads found by thread1, but they all >>> + depended on threading through distinct loops in ethread. */ >>> /* { dg-final { scan-tree-dump-times "Threaded" 2 "vrp1" } } */ >>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c >>> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c >>> index e3d4b311c03..16abcde5053 100644 >>> --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c >>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c >>> @@ -1,8 +1,8 @@ >>> /* { dg-do compile } */ >>> /* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-thread2-stats >>> -fdump-tree-dom2-stats -fdump-tree-thread3-stats -fdump-tree-dom3-stats >>> -fdump-tree-vrp2-stats -fno-guess-branch-probability" } */ >>> >>> -/* { dg-final { scan-tree-dump "Jumps threaded: 18" "thread1" } } */ >>> -/* { dg-final { scan-tree-dump "Jumps threaded: 8" "thread3" { target { ! >>> aarch64*-*-* } } } } */ >>> +/* { dg-final { scan-tree-dump "Jumps threaded: 12" "thread1" } } */ >>> +/* { dg-final { scan-tree-dump "Jumps threaded: 5" "thread3" { target { ! >>> aarch64*-*-* } } } } */ >>> /* { dg-final { scan-tree-dump-not "Jumps threaded" "dom2" } } */ >>> >>> /* aarch64 has the highest CASE_VALUES_THRESHOLD in GCC. It's high enough >>> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c >>> b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c >>> index 664e93e9b60..e68a9b62535 100644 >>> --- a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c >>> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c >>> @@ -1,8 +1,5 @@ >>> /* { dg-require-effective-target vect_int } */ >>> >>> -/* See note below as to why we disable threading. */ >>> -/* { dg-additional-options "-fdisable-tree-thread1" } */ >>> - >>> #include <stdarg.h> >>> #include "tree-vect.h" >>> >>> @@ -30,10 +27,6 @@ main1 (int dummy) >>> *pout++ = *pin++ + a; >>> *pout++ = *pin++ + a; >>> *pout++ = *pin++ + a; >>> - /* In some architectures like ppc64, jump threading may thread >>> - the iteration where i==0 such that we no longer optimize the >>> - BB. Another alternative to disable jump threading would be >>> - to wrap the read from `i' into a function returning i. */ >>> if (arr[i] = i) >>> a = i; >>> else >>> diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c >>> index baac11280fa..2b9b8f81274 100644 >>> --- a/gcc/tree-ssa-threadupdate.c >>> +++ b/gcc/tree-ssa-threadupdate.c >>> @@ -2757,6 +2757,58 @@ fwd_jt_path_registry::update_cfg (bool >>> may_peel_loop_headers) >>> return retval; >>> } >>> >>> +bool >>> +jt_path_registry::cancel_invalid_paths (vec<jump_thread_edge *> &path) >>> +{ >>> + gcc_checking_assert (!path.is_empty ()); >>> + edge taken_edge = path[path.length () - 1]->e; >>> + loop_p loop = taken_edge->src->loop_father; >>> + bool seen_latch = false; >>> + bool path_crosses_loops = false; >>> + >>> + for (unsigned int i = 0; i < path.length (); i++) >>> + { >>> + edge e = path[i]->e; >>> + >>> + if (e == NULL) >>> + { >>> + // NULL outgoing edges on a path can happen for jumping to a >>> + // constant address. >>> + cancel_thread (&path, "Found NULL edge in jump threading path"); >>> + return true; >>> + } >>> + >>> + if (loop->latch == e->src || loop->latch == e->dest) >>> + seen_latch = true; >>> + >>> + // The first entry represents the block with an outgoing edge >>> + // that we will redirect to the jump threading path. Thus we >>> + // don't care about that block's loop father. >>> + if ((i > 0 && e->src->loop_father != loop) >>> + || e->dest->loop_father != loop) >>> + path_crosses_loops = true; >>> + >>> + if (flag_checking && !m_backedge_threads) >>> + gcc_assert ((path[i]->e->flags & EDGE_DFS_BACK) == 0); >>> + } >>> + >>> + if (cfun->curr_properties & PROP_loop_opts_done) >>> + return false; >>> + >>> + if (seen_latch && empty_block_p (loop->latch)) >>> + { >>> + cancel_thread (&path, "Threading through latch before loop opts " >>> + "would create non-empty latch"); >>> + return true; >>> + } >>> + if (path_crosses_loops) >>> + { >>> + cancel_thread (&path, "Path crosses loops"); >>> + return true; >>> + } >>> + return false; >>> +} >>> + >>> /* Register a jump threading opportunity. We queue up all the jump >>> threading opportunities discovered by a pass and update the CFG >>> and SSA form all at once. >>> @@ -2776,19 +2828,8 @@ jt_path_registry::register_jump_thread >>> (vec<jump_thread_edge *> *path) >>> return false; >>> } >>> >>> - /* First make sure there are no NULL outgoing edges on the jump threading >>> - path. That can happen for jumping to a constant address. */ >>> - for (unsigned int i = 0; i < path->length (); i++) >>> - { >>> - if ((*path)[i]->e == NULL) >>> - { >>> - cancel_thread (path, "Found NULL edge in jump threading path"); >>> - return false; >>> - } >>> - >>> - if (flag_checking && !m_backedge_threads) >>> - gcc_assert (((*path)[i]->e->flags & EDGE_DFS_BACK) == 0); >>> - } >>> + if (cancel_invalid_paths (*path)) >>> + return false; >>> >>> if (dump_file && (dump_flags & TDF_DETAILS)) >>> dump_jump_thread_path (dump_file, *path, true); >>> diff --git a/gcc/tree-ssa-threadupdate.h b/gcc/tree-ssa-threadupdate.h >>> index 8b48a671212..d68795c9f27 100644 >>> --- a/gcc/tree-ssa-threadupdate.h >>> +++ b/gcc/tree-ssa-threadupdate.h >>> @@ -75,6 +75,7 @@ protected: >>> unsigned long m_num_threaded_edges; >>> private: >>> virtual bool update_cfg (bool peel_loop_headers) = 0; >>> + bool cancel_invalid_paths (vec<jump_thread_edge *> &path); >>> jump_thread_path_allocator m_allocator; >>> // True if threading through back edges is allowed. This is only >>> // allowed in the generic copier in the backward threader. >>> </cut> > <jeff.txt>