On 6/22/24 09:15, Richard Biener wrote:
On Fri, Jun 21, 2024 at 3:02 PM Andrew MacLeod <amacl...@redhat.com> wrote:
This patch adds

      --param=vrp-block-limit=N

When the basic block counter for a function exceeded 'N' , VRP is
invoked with the new fast_vrp algorithm instead.   This algorithm uses a
lot less memory and processing power, although it does get a few less
things.

Primary motivation is cases like
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855 in which the 3  VRP
passes consume about 600 seconds of the compile time, and a lot of
memory.      With fast_vrp, it spends less than 10 seconds total in the
3 passes of VRP.     This test case has about 400,000 basic blocks.

The default for N in this patch is 150,000,  arbitrarily chosen.

This bootstraps, (and I bootstrapped it with --param=vrp-block-limit=0
as well) on x86_64-pc-linux-gnu, with no regressions.

What do you think, OK for trunk?
+      if (last_basic_block_for_fn (fun) > param_vrp_block_limit ||
+         &data == &pass_data_fast_vrp)

|| goes to the next line.

Btw, we have -Wdisabled-optimization for these cases which should
say sth like "function has excess of %d number of basic blocks
(--param vrp-block-limit=%d), using fast VRP algorithm"
or so in this case.

As I wrote in the PR the priority should be -O1 compile-time
performance and memory use.


Yeah, I just wanted to use it as a model for "bad" cases for ranger.   Adjusted patch attached which now issues the warning.

I also found that the transitive relations were causing a small blowup in time for relation processing now that I turned relations on for fast VRP.  I commited a patch and fast_vrp no longer does transitives.

If you want to experiment with enabling fast VRP at -O1, it should be fast all the time now.  I think :-)    This testcase runs in about 95 seconds on my test machine.  if I turn on VRP, a single VRP pass takes about 2.5 seconds.    Its all set up, you can just add:

NEXT_PASS (pass_fast_vrp);

at an appropriate spot.

Richard.

Andrew

PS sorry,. it doesn't help the threader in that PR :-(
It's probably one of the known bottlenecks there - IIRC the path range
queries make O(n^2) work.  I can look at this at some point as I've
dealt with the slowness of this pass in the past.

There is param_max_jump_thread_paths that should limit searching
but there is IIRC no way to limit the work path ranger actually does
when resolving the query.

Yeah, Id like to talk to Aldy about revamping the threader now that some of the newer facilities are available that fast_vrp uses.

We can calculate all the outgoing ranges for a block at once with :

  // Fill ssa-cache R with any outgoing ranges on edge E, using QUERY.
  bool gori_on_edge (class ssa_cache &r, edge e, range_query *query = NULL);

This is what the fast_vrp routines uses.  We can gather all range restrictions generated from an edge efficiently just once and then intersect them with a known range as we walk the different paths. We don't need the gori exports , nor any of the other on-demand bits where we calculate each export range dynamically.. I suspect it would reduce the workload and memory impact quite a bit, but I'm not really familiar with exactly how the threader uses those things.

It'd require some minor tweaking to the lazy_ssa_cache to make the bitmap of names set accessible. This  would provide similar functionality to what the gori export () routine provides.  Both relations and inferred ranges should only need to be calculated once per block as well and could/should/would be applied the same way if they are present.   I don't *think* the threader uses any of the def chains, but Aldy can chip in.

Andrew
From 15f697aad90c35e42a4416d7db6e7289c0f5aae3 Mon Sep 17 00:00:00 2001
From: Andrew MacLeod <amacl...@redhat.com>
Date: Mon, 17 Jun 2024 11:38:46 -0400
Subject: [PATCH 2/2] Add param for bb limit to invoke fast_vrp.

If the basic block count is too high, simply use fast_vrp for all
VRP passes.

	gcc/doc/
	* invoke.texi (vrp-block-limit): Document.

	gcc/
	* params.opt (-param=vrp-block-limit): New.
	* tree-vrp.cc (fvrp_folder::execute): Invoke fast_vrp if block
	count exceeds limit.
---
 gcc/doc/invoke.texi |  3 +++
 gcc/params.opt      |  4 ++++
 gcc/tree-vrp.cc     | 16 ++++++++++++++--
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index c790e2f3518..80da5e9d306 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -16840,6 +16840,9 @@ this parameter.  The default value of this parameter is 50.
 @item vect-induction-float
 Enable loop vectorization of floating point inductions.
 
+@item vrp-block-limit
+Maximum number of basic blocks before VRP switches to a lower memory algorithm.
+
 @item vrp-sparse-threshold
 Maximum number of basic blocks before VRP uses a sparse bitmap cache.
 
diff --git a/gcc/params.opt b/gcc/params.opt
index d34ef545bf0..c17ba17b91b 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -1198,6 +1198,10 @@ The maximum factor which the loop vectorizer applies to the cost of statements i
 Common Joined UInteger Var(param_vect_induction_float) Init(1) IntegerRange(0, 1) Param Optimization
 Enable loop vectorization of floating point inductions.
 
+-param=vrp-block-limit=
+Common Joined UInteger Var(param_vrp_block_limit) Init(150000) Optimization Param
+Maximum number of basic blocks before VRP switches to a fast model with less memory requirements.
+
 -param=vrp-sparse-threshold=
 Common Joined UInteger Var(param_vrp_sparse_threshold) Init(3000) Optimization Param
 Maximum number of basic blocks before VRP uses a sparse bitmap cache.
diff --git a/gcc/tree-vrp.cc b/gcc/tree-vrp.cc
index 26979b706e5..a81fcc696b7 100644
--- a/gcc/tree-vrp.cc
+++ b/gcc/tree-vrp.cc
@@ -60,6 +60,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "ipa-cp.h"
 #include "ipa-prop.h"
 #include "attribs.h"
+#include "diagnostic-core.h"
 
 // This class is utilized by VRP and ranger to remove __builtin_unreachable
 // calls, and reflect any resulting global ranges.
@@ -1331,9 +1332,20 @@ public:
   unsigned int execute (function *fun) final override
     {
       // Check for fast vrp.
-      if (&data == &pass_data_fast_vrp)
-	return execute_fast_vrp (fun, final_p);
+      bool use_fvrp = (&data == &pass_data_fast_vrp);
+      if (!use_fvrp && last_basic_block_for_fn (fun) > param_vrp_block_limit)
+	{
+	  use_fvrp = true;
+	  warning (OPT_Wdisabled_optimization,
+		   "Using fast VRP algorithm. %d basic blocks"
+		   " exceeds %s%d limit",
+		   n_basic_blocks_for_fn (fun),
+		   "--param=vrp-block-limit=",
+		   param_vrp_block_limit);
+	}
 
+      if (use_fvrp)
+	return execute_fast_vrp (fun, final_p);
       return execute_ranger_vrp (fun, final_p);
     }
 
-- 
2.45.0

Reply via email to