Re: Lazy JIT IR code generation to increase JIT speed with partitions

Luc Vlaming Wed, 30 Dec 2020 05:23:59 -0800

On 30-12-2020 02:57, Andres Freund wrote:

Hi,


Great to see work in this area!

On 2020-12-28 09:44:26 +0100, Luc Vlaming wrote:

I would like to propose a small patch to the JIT machinery which makes the
IR code generation lazy. The reason for postponing the generation of the IR
code is that with partitions we get an explosion in the number of JIT
functions generated as many child tables are involved, each with their own
JITted functions, especially when e.g. partition-aware joins/aggregates are
enabled. However, only a fraction of those functions is actually executed
because the Parallel Append node distributes the workers among the nodes.
With the attached patch we get a lazy generation which makes that this is no
longer a problem.


I unfortunately don't think this is quite good enough, because it'll
lead to emitting all functions separately, which can also lead to very
substantial increases of the required time (as emitting code is an
expensive step). Obviously that is only relevant in the cases where the
generated functions actually end up being used - which isn't the case in
your example.

If you e.g. look at a query like
   SELECT blub, count(*),sum(zap) FROM foo WHERE blarg = 3 GROUP BY blub;
on a table without indexes, you would end up with functions for

- WHERE clause (including deforming)
- projection (including deforming)
- grouping key
- aggregate transition
- aggregate result projection

with your patch each of these would be emitted separately, instead of
one go. Which IIRC increases the required time by a significant amount,
especially if inlining is done (where each separate code generation ends
up with copies of the inlined code).


As far as I can see you've basically falsified the second part of this
comment (which you moved):

+
+       /*
+        * Don't immediately emit nor actually generate the function.
+        * instead do so the first time the expression is actually evaluated.
+        * That allows to emit a lot of functions together, avoiding a lot of
+        * repeated llvm and memory remapping overhead. It also helps with not
+        * compiling functions that will never be evaluated, as can be the case
+        * if e.g. a parallel append node is distributing workers between its
+        * child nodes.
+        */

-       /*
-        * Don't immediately emit function, instead do so the first time the
-        * expression is actually evaluated. That allows to emit a lot of
-        * functions together, avoiding a lot of repeated llvm and memory
-        * remapping overhead.
-        */


Greetings,

Andres Freund


Hi,

Happy to help out, and thanks for the info and suggestions.

Also, I should have first searched psql-hackers and the like, as I just found out there is already discussions about this in [1] and [2]. However I think the approach I took can be taken independently and then other solutions could be added on top.


Assuming I understood all suggestions correctly, the ideas so far are:

1. add a LLVMAddMergeFunctionsPass so that duplicate code is removed and not optimized several times (see [1]). Requires all code to be emitted in the same module.

2. JIT only parts of the plan, based on cost (see [2]).

3. Cache compilation results to avoid recompilation. this would either need a shm capable optimized IR cache or would not work with parallel workers.

4. Lazily jitting (this patch)

An idea that might not have been presented in the mailing list yet(?):

5. Only JIT in nodes that process a certain amount of rows. Assuming there is a constant overhead for JITting and the goal is to gain runtime.

Going forward I would first try to see if my current approach can work out. The only idea that would be counterproductive to my solution would be solution 1. Afterwards I'd like to continue with either solution 2, 5, or 3 in the hopes that we can reduce JIT overhead to a minimum and can therefore apply it more broadly.

To test out why and where the JIT performance decreased with my solution I improved the test script and added various queries to model some of the cases I think we should care about. I have not (yet) done big scale benchmarks as these queries seemed to already show enough problems for now. Now there are 4 queries which test JITting with/without partitions, and with varying amounts of workers and rowcounts. I hope these are indeed a somewhat representative set of queries.

As pointed out the current patch does create a degradation in performance wrt queries that are not partitioned (basically q3 and q4). After looking into those queries I noticed two things: - q3 is very noisy wrt JIT timings. This seems to be the result of something wrt parallel workers starting up the JITting and creating very high amounts of noise (e.g. inlining timings varying between 3.8s and 6.2s)

- q4 seems very stable with JIT timings (after the first run).

I'm wondering if this could mean that with parallel workers quite a lot of time is spent on startup of the llvm machinery and this gets noisy because of OS interaction and the like?

Either way I took q4 to try and fix the regression and noticed something interesting, given the comment from Andres: the generation and inlining time actually decreased, but the optimization and emission time increased. After trying out various things in the llvm_optimize_module function and googling a bit it seems that the LLVMPassManagerBuilderUseInlinerWithThreshold adds some very expensive passes. I tried to construct some queries where this would actually gain us but couldnt (yet).

For v2 of the patch-set the first patch slightly changes how we optimize the code, which removes the aforementioned degradations in the queries. The second patch then makes that partitions work a lot better, but interestingly now also q4 gets a lot faster but somehow q3 does not.

Because these findings contradict the suggestions/findings from Andres I'm wondering what I'm missing. I would continue and do some TPC-H like tests on top, but apart from that I'm not entirely sure where we are supposed to gain most from the call to LLVMPassManagerBuilderUseInlinerWithThreshold(). Reason is that from the scenarios I now tested it seems that the pain is actually in the code optimization and possibly rather specific passes and not necessarily in how many modules are emitted.

If there are more / better queries / datasets / statistics I can run and gather I would be glad to do so :) To me the current results seem however fairly promising.


Looking forward to your thoughts & suggestions.

With regards,
Luc
Swarm64

===================================
Results from the test script on my machine:

 parameters: jit=on workers=5 jit-inline=0 jit-optimize=0
 query1: HEAD       - 08.088901 #runs=5 #JIT=12014
 query1: HEAD+01    - 06.369646 #runs=5 #JIT=12014
 query1: HEAD+01+02 - 01.248596 #runs=5 #JIT=1044
 query2: HEAD       - 17.628126 #runs=5 #JIT=24074
 query2: HEAD+01    - 10.786114 #runs=5 #JIT=24074
 query2: HEAD+01+02 - 01.262084 #runs=5 #JIT=1083
 query3: HEAD       - 00.220141 #runs=5 #JIT=29
 query3: HEAD+01    - 00.210917 #runs=5 #JIT=29
 query3: HEAD+01+02 - 00.229575 #runs=5 #JIT=25
 query4: HEAD       - 00.052305 #runs=100 #JIT=10
 query4: HEAD+01    - 00.038319 #runs=100 #JIT=10
 query4: HEAD+01+02 - 00.018533 #runs=100 #JIT=3

 parameters: jit=on workers=50 jit-inline=0 jit-optimize=0
 query1: HEAD       - 14.922044 #runs=5 #JIT=102104
 query1: HEAD+01    - 11.356347 #runs=5 #JIT=102104
 query1: HEAD+01+02 - 00.641409 #runs=5 #JIT=1241
 query2: HEAD       - 18.477133 #runs=5 #JIT=40122
 query2: HEAD+01    - 11.028579 #runs=5 #JIT=40122
 query2: HEAD+01+02 - 00.872588 #runs=5 #JIT=1087
 query3: HEAD       - 00.235587 #runs=5 #JIT=209
 query3: HEAD+01    - 00.219597 #runs=5 #JIT=209
 query3: HEAD+01+02 - 00.233975 #runs=5 #JIT=127
 query4: HEAD       - 00.052534 #runs=100 #JIT=10
 query4: HEAD+01    - 00.038881 #runs=100 #JIT=10
 query4: HEAD+01+02 - 00.018268 #runs=100 #JIT=3

 parameters: jit=on workers=50 jit-inline=1e+06 jit-optimize=0
 query1: HEAD       - 12.696588 #runs=5 #JIT=102104
 query1: HEAD+01    - 12.279387 #runs=5 #JIT=102104
 query1: HEAD+01+02 - 00.512643 #runs=5 #JIT=1211
 query2: HEAD       - 12.091824 #runs=5 #JIT=40122
 query2: HEAD+01    - 11.543042 #runs=5 #JIT=40122
 query2: HEAD+01+02 - 00.774382 #runs=5 #JIT=1088
 query3: HEAD       - 00.122208 #runs=5 #JIT=209
 query3: HEAD+01    - 00.114153 #runs=5 #JIT=209
 query3: HEAD+01+02 - 00.139906 #runs=5 #JIT=131
 query4: HEAD       - 00.033125 #runs=100 #JIT=10
 query4: HEAD+01    - 00.029818 #runs=100 #JIT=10
 query4: HEAD+01+02 - 00.015099 #runs=100 #JIT=3

 parameters: jit=on workers=50 jit-inline=0 jit-optimize=1e+06
 query1: HEAD       - 02.760343 #runs=5 #JIT=102104
 query1: HEAD+01    - 02.742944 #runs=5 #JIT=102104
 query1: HEAD+01+02 - 00.460169 #runs=5 #JIT=1292
 query2: HEAD       - 02.396965 #runs=5 #JIT=40122
 query2: HEAD+01    - 02.394724 #runs=5 #JIT=40122
 query2: HEAD+01+02 - 00.425303 #runs=5 #JIT=1089
 query3: HEAD       - 00.186633 #runs=5 #JIT=209
 query3: HEAD+01    - 00.189623 #runs=5 #JIT=209
 query3: HEAD+01+02 - 00.193272 #runs=5 #JIT=125
 query4: HEAD       - 00.013277 #runs=100 #JIT=10
 query4: HEAD+01    - 00.012078 #runs=100 #JIT=10
 query4: HEAD+01+02 - 00.004846 #runs=100 #JIT=3

 parameters: jit=on workers=50 jit-inline=1e+06 jit-optimize=1e+06
 query1: HEAD       - 02.339973 #runs=5 #JIT=102104
 query1: HEAD+01    - 02.333525 #runs=5 #JIT=102104
 query1: HEAD+01+02 - 00.342824 #runs=5 #JIT=1243
 query2: HEAD       - 02.268987 #runs=5 #JIT=40122
 query2: HEAD+01    - 02.248729 #runs=5 #JIT=40122
 query2: HEAD+01+02 - 00.306829 #runs=5 #JIT=1088
 query3: HEAD       - 00.084531 #runs=5 #JIT=209
 query3: HEAD+01    - 00.091616 #runs=5 #JIT=209
 query3: HEAD+01+02 - 00.08668  #runs=5 #JIT=127
 query4: HEAD       - 00.005371 #runs=100 #JIT=10
 query4: HEAD+01    - 00.0053   #runs=100 #JIT=10
 query4: HEAD+01+02 - 00.002422 #runs=100 #JIT=3

===================================

[1] https://www.postgresql.org/message-id/flat/7736C40E-6DB5-4E7A-8FE3-4B2AB8E22793%40elevated-dev.com [2] https://www.postgresql.org/message-id/flat/CAApHDvpQJqLrNOSi8P1JLM8YE2C%2BksKFpSdZg%3Dq6sTbtQ-v%3Daw%40mail.gmail.com

jit_partitions.sql
Description: application/sql

>From 59ce1e834a9adb1cd15484df6fd81015e8e8fe80 Mon Sep 17 00:00:00 2001
From: Luc Vlaming <l...@swarm64.com>
Date: Wed, 30 Dec 2020 11:35:12 +0100
Subject: [PATCH v2 1/2] improve jitting performance somewhat

---
 src/backend/commands/explain.c |  1 +
 src/backend/jit/llvm/llvmjit.c | 17 ++++++-----------
 src/include/jit/jit.h          |  3 +++
 3 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 43f9b01e83..4a6a9dfb67 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -858,6 +858,7 @@ ExplainPrintJIT(ExplainState *es, int jit_flags, JitInstrumentation *ji)
 		appendStringInfoString(es->str, "JIT:\n");
 		es->indent++;
 
+		ExplainPropertyInteger("Modules", NULL, ji->created_modules, es);
 		ExplainPropertyInteger("Functions", NULL, ji->created_functions, es);
 
 		ExplainIndentText(es);
diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c
index 9c4fc75f65..dd23a09f40 100644
--- a/src/backend/jit/llvm/llvmjit.c
+++ b/src/backend/jit/llvm/llvmjit.c
@@ -241,6 +241,8 @@ llvm_mutable_module(LLVMJitContext *context)
 		context->module = LLVMModuleCreateWithName("pg");
 		LLVMSetTarget(context->module, llvm_triple);
 		LLVMSetDataLayout(context->module, llvm_layout);
+
+		context->base.instr.created_modules++;
 	}
 
 	return context->module;
@@ -578,12 +580,7 @@ llvm_optimize_module(LLVMJitContext *context, LLVMModuleRef module)
 	LLVMPassManagerBuilderSetOptLevel(llvm_pmb, compile_optlevel);
 	llvm_fpm = LLVMCreateFunctionPassManagerForModule(module);
 
-	if (context->base.flags & PGJIT_OPT3)
-	{
-		/* TODO: Unscientifically determined threshold */
-		LLVMPassManagerBuilderUseInlinerWithThreshold(llvm_pmb, 512);
-	}
-	else
+	if (!(context->base.flags & PGJIT_OPT3))
 	{
 		/* we rely on mem2reg heavily, so emit even in the O0 case */
 		LLVMAddPromoteMemoryToRegisterPass(llvm_fpm);
@@ -611,11 +608,9 @@ llvm_optimize_module(LLVMJitContext *context, LLVMModuleRef module)
 	LLVMPassManagerBuilderPopulateModulePassManager(llvm_pmb,
 													llvm_mpm);
 	/* always use always-inliner pass */
-	if (!(context->base.flags & PGJIT_OPT3))
-		LLVMAddAlwaysInlinerPass(llvm_mpm);
-	/* if doing inlining, but no expensive optimization, add inlining pass */
-	if (context->base.flags & PGJIT_INLINE
-		&& !(context->base.flags & PGJIT_OPT3))
+	LLVMAddAlwaysInlinerPass(llvm_mpm);
+	/* if doing inlining, add inlining pass */
+	if (context->base.flags & PGJIT_INLINE)
 		LLVMAddFunctionInliningPass(llvm_mpm);
 	LLVMRunPassManager(llvm_mpm, context->module);
 	LLVMDisposePassManager(llvm_mpm);
diff --git a/src/include/jit/jit.h b/src/include/jit/jit.h
index 5a77f7d4a0..a0c29a76c8 100644
--- a/src/include/jit/jit.h
+++ b/src/include/jit/jit.h
@@ -26,6 +26,9 @@
 
 typedef struct JitInstrumentation
 {
+	/* number of emitted modules */
+	size_t		created_modules;
+
 	/* number of emitted functions */
 	size_t		created_functions;
 
-- 
2.25.1

>From e9ce1a0f066fd7d1bd9e5a66863dcb79e67c552f Mon Sep 17 00:00:00 2001
From: Luc Vlaming <l...@swarm64.com>
Date: Mon, 28 Dec 2020 09:01:32 +0100
Subject: [PATCH v2 2/2] generate JIT IR code lazily

---
 src/backend/jit/llvm/llvmjit_expr.c | 96 +++++++++++++++++------------
 1 file changed, 57 insertions(+), 39 deletions(-)

diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 3aa08a9743..7483803d4a 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -52,6 +52,7 @@ typedef struct CompiledExprState
 } CompiledExprState;
 
 
+static Datum ExecCompileExpr(ExprState *state, ExprContext *econtext, bool *isNull);
 static Datum ExecRunCompiledExpr(ExprState *state, ExprContext *econtext, bool *isNull);
 
 static LLVMValueRef BuildV1Call(LLVMJitContext *context, LLVMBuilderRef b,
@@ -70,18 +71,64 @@ static LLVMValueRef create_LifetimeEnd(LLVMModuleRef mod);
 					   lengthof(((LLVMValueRef[]){__VA_ARGS__})), \
 					   ((LLVMValueRef[]){__VA_ARGS__}))
 
-
 /*
- * JIT compile expression.
+ * Prepare the JIT compile expression.
  */
 bool
 llvm_compile_expr(ExprState *state)
 {
 	PlanState  *parent = state->parent;
-	char	   *funcname;
-
 	LLVMJitContext *context = NULL;
 
+
+	/*
+	 * Right now we don't support compiling expressions without a parent, as
+	 * we need access to the EState.
+	 */
+	Assert(parent);
+
+	llvm_enter_fatal_on_oom();
+
+	/* get or create JIT context */
+	if (parent->state->es_jit)
+		context = (LLVMJitContext *) parent->state->es_jit;
+	else
+	{
+		context = llvm_create_context(parent->state->es_jit_flags);
+		parent->state->es_jit = &context->base;
+	}
+
+	/*
+	 * Don't immediately emit nor actually generate the function.
+	 * Instead do so the first time the expression is actually evaluated.
+	 * This helps with not compiling functions that will never be evaluated,
+	 * as can be the case if e.g. a parallel append node is distributing
+	 * workers between its child nodes.
+	 */
+	{
+
+		CompiledExprState *cstate = palloc0(sizeof(CompiledExprState));
+
+		cstate->context = context;
+
+		state->evalfunc = ExecCompileExpr;
+		state->evalfunc_private = cstate;
+	}
+
+	llvm_leave_fatal_on_oom();
+
+	return true;
+}
+
+/*
+ * JIT compile expression.
+ */
+static Datum
+ExecCompileExpr(ExprState *state, ExprContext *econtext, bool *isNull)
+{
+	CompiledExprState *cstate = state->evalfunc_private;
+	LLVMJitContext *context = cstate->context;
+
 	LLVMBuilderRef b;
 	LLVMModuleRef mod;
 	LLVMValueRef eval_fn;
@@ -125,31 +172,16 @@ llvm_compile_expr(ExprState *state)
 
 	llvm_enter_fatal_on_oom();
 
-	/*
-	 * Right now we don't support compiling expressions without a parent, as
-	 * we need access to the EState.
-	 */
-	Assert(parent);
-
-	/* get or create JIT context */
-	if (parent->state->es_jit)
-		context = (LLVMJitContext *) parent->state->es_jit;
-	else
-	{
-		context = llvm_create_context(parent->state->es_jit_flags);
-		parent->state->es_jit = &context->base;
-	}
-
 	INSTR_TIME_SET_CURRENT(starttime);
 
 	mod = llvm_mutable_module(context);
 
 	b = LLVMCreateBuilder();
 
-	funcname = llvm_expand_funcname(context, "evalexpr");
+	cstate->funcname = llvm_expand_funcname(context, "evalexpr");
 
 	/* create function */
-	eval_fn = LLVMAddFunction(mod, funcname,
+	eval_fn = LLVMAddFunction(mod, cstate->funcname,
 							  llvm_pg_var_func_type("TypeExprStateEvalFunc"));
 	LLVMSetLinkage(eval_fn, LLVMExternalLinkage);
 	LLVMSetVisibility(eval_fn, LLVMDefaultVisibility);
@@ -2350,30 +2382,16 @@ llvm_compile_expr(ExprState *state)
 
 	LLVMDisposeBuilder(b);
 
-	/*
-	 * Don't immediately emit function, instead do so the first time the
-	 * expression is actually evaluated. That allows to emit a lot of
-	 * functions together, avoiding a lot of repeated llvm and memory
-	 * remapping overhead.
-	 */
-	{
-
-		CompiledExprState *cstate = palloc0(sizeof(CompiledExprState));
-
-		cstate->context = context;
-		cstate->funcname = funcname;
-
-		state->evalfunc = ExecRunCompiledExpr;
-		state->evalfunc_private = cstate;
-	}
-
 	llvm_leave_fatal_on_oom();
 
 	INSTR_TIME_SET_CURRENT(endtime);
 	INSTR_TIME_ACCUM_DIFF(context->base.instr.generation_counter,
 						  endtime, starttime);
 
-	return true;
+	/* remove indirection via this function for future calls */
+	state->evalfunc = ExecRunCompiledExpr;
+
+	return ExecRunCompiledExpr(state, econtext, isNull);
 }
 
 /*
-- 
2.25.1

Re: Lazy JIT IR code generation to increase JIT speed with partitions

Reply via email to