Hello.

I'd like to make MergeAppend node Async-capable like Append node. Nowadays when planner chooses MergeAppend plan, asynchronous execution is not possible. With attached patches you can see plans like

EXPLAIN (VERBOSE, COSTS OFF)
SELECT * FROM async_pt WHERE b % 100 = 0 ORDER BY b, a;
                                                          QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
 Merge Append
   Sort Key: async_pt.b, async_pt.a
   ->  Async Foreign Scan on public.async_p1 async_pt_1
         Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE (((b % 100) = 0)) ORDER BY b ASC NULLS LAST, a ASC NULLS LAST
   ->  Async Foreign Scan on public.async_p2 async_pt_2
         Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
Remote SQL: SELECT a, b, c FROM public.base_tbl2 WHERE (((b % 100) = 0)) ORDER BY b ASC NULLS LAST, a ASC NULLS LAST

This can be quite profitable (in our test cases you can gain up to two times better speed with MergeAppend async execution on remote servers).

Code for asynchronous execution in Merge Append was mostly borrowed from Append node.

What significantly differs - in ExecMergeAppendAsyncGetNext() you must return tuple from the specified slot. Subplan number determines tuple slot where data should be retrieved to. When subplan is ready to provide some data, it's cached in ms_asyncresults. When we get tuple for subplan, specified in ExecMergeAppendAsyncGetNext(), ExecMergeAppendAsyncRequest() returns true and loop in ExecMergeAppendAsyncGetNext() ends. We can fetch data for subplans which either don't have cached result ready or have already returned them to the upper node. This flag is stored in ms_has_asyncresults. As we can get data for some subplan either earlier or after loop in ExecMergeAppendAsyncRequest(),
we check this flag twice in this function.
Unlike ExecAppendAsyncEventWait(), it seems ExecMergeAppendAsyncEventWait() doesn't need a timeout - as there's no need to get result from synchronous subplan if a tuple form async one was explicitly requested.

Also we had to fix postgres_fdw to avoid directly looking at Append fields. Perhaps, accesors to Append fields look strange, but allows to avoid some code duplication. I suppose, duplication could be even less if we reworked async Append implementation, but so far I haven't
tried to do this to avoid big diff from master.

Also mark_async_capable() believes that path corresponds to plan. This can be not true when create_[merge_]append_plan() inserts sort node. In this case mark_async_capable() can treat Sort plan node as some other and crash, so there's a small fix for this.
--
Best regards,
Alexander Pyhalov,
Postgres Professional
From 79e40d6bcdfa96435c06d2c668efd76f3bcd5325 Mon Sep 17 00:00:00 2001
From: Alexander Pyhalov <a.pyha...@postgrespro.ru>
Date: Fri, 21 Jul 2023 12:05:07 +0300
Subject: [PATCH 1/2] mark_async_capable(): subpath should match subplan

---
 src/backend/optimizer/plan/createplan.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 6b64c4a362d..c2e879815b1 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1146,10 +1146,10 @@ mark_async_capable_plan(Plan *plan, Path *path)
 				SubqueryScan *scan_plan = (SubqueryScan *) plan;
 
 				/*
-				 * If the generated plan node includes a gating Result node,
-				 * we can't execute it asynchronously.
+				 * If the generated plan node includes a gating Result node or
+				 * a Sort node, we can't execute it asynchronously.
 				 */
-				if (IsA(plan, Result))
+				if (IsA(plan, Result) || IsA(plan, Sort))
 					return false;
 
 				/*
@@ -1167,10 +1167,10 @@ mark_async_capable_plan(Plan *plan, Path *path)
 				FdwRoutine *fdwroutine = path->parent->fdwroutine;
 
 				/*
-				 * If the generated plan node includes a gating Result node,
-				 * we can't execute it asynchronously.
+				 * If the generated plan node includes a gating Result node or
+				 * a Sort node, we can't execute it asynchronously.
 				 */
-				if (IsA(plan, Result))
+				if (IsA(plan, Result) || IsA(plan, Sort))
 					return false;
 
 				Assert(fdwroutine != NULL);
@@ -1183,9 +1183,9 @@ mark_async_capable_plan(Plan *plan, Path *path)
 
 			/*
 			 * If the generated plan node includes a Result node for the
-			 * projection, we can't execute it asynchronously.
+			 * projection or a Sort node, we can't execute it asynchronously.
 			 */
-			if (IsA(plan, Result))
+			if (IsA(plan, Result) || IsA(plan, Sort))
 				return false;
 
 			/*
-- 
2.34.1

From 74546b7eebcb61b5c16239a056bcee308cdef09e Mon Sep 17 00:00:00 2001
From: Alexander Pyhalov <a.pyha...@postgrespro.ru>
Date: Wed, 19 Jul 2023 15:42:24 +0300
Subject: [PATCH 2/2] MergeAppend should support Async Foreign Scan subplans

---
 .../postgres_fdw/expected/postgres_fdw.out    | 100 ++++
 contrib/postgres_fdw/postgres_fdw.c           |  10 +-
 contrib/postgres_fdw/sql/postgres_fdw.sql     |  15 +
 doc/src/sgml/config.sgml                      |  14 +
 src/backend/executor/execAsync.c              |   4 +
 src/backend/executor/nodeMergeAppend.c        | 458 +++++++++++++++++-
 src/backend/optimizer/path/costsize.c         |   1 +
 src/backend/optimizer/plan/createplan.c       |   8 +
 src/backend/utils/misc/guc_tables.c           |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/executor/nodeMergeAppend.h        |   1 +
 src/include/nodes/execnodes.h                 |  56 +++
 src/include/optimizer/cost.h                  |   1 +
 src/test/regress/expected/sysviews.out        |   3 +-
 14 files changed, 676 insertions(+), 6 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 0cc77190dc4..f93ea2e05a7 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -11154,6 +11154,46 @@ SELECT * FROM result_tbl ORDER BY a;
 (2 rows)
 
 DELETE FROM result_tbl;
+-- Test MergeAppend
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM async_pt WHERE b % 100 = 0 ORDER BY b, a;
+                                                          QUERY PLAN                                                          
+------------------------------------------------------------------------------------------------------------------------------
+ Merge Append
+   Sort Key: async_pt.b, async_pt.a
+   ->  Async Foreign Scan on public.async_p1 async_pt_1
+         Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+         Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE (((b % 100) = 0)) ORDER BY b ASC NULLS LAST, a ASC NULLS LAST
+   ->  Async Foreign Scan on public.async_p2 async_pt_2
+         Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
+         Remote SQL: SELECT a, b, c FROM public.base_tbl2 WHERE (((b % 100) = 0)) ORDER BY b ASC NULLS LAST, a ASC NULLS LAST
+(8 rows)
+
+SELECT * FROM async_pt WHERE b % 100 = 0 ORDER BY b, a;
+  a   |  b  |  c   
+------+-----+------
+ 1000 |   0 | 0000
+ 2000 |   0 | 0000
+ 1100 | 100 | 0100
+ 2100 | 100 | 0100
+ 1200 | 200 | 0200
+ 2200 | 200 | 0200
+ 1300 | 300 | 0300
+ 2300 | 300 | 0300
+ 1400 | 400 | 0400
+ 2400 | 400 | 0400
+ 1500 | 500 | 0500
+ 2500 | 500 | 0500
+ 1600 | 600 | 0600
+ 2600 | 600 | 0600
+ 1700 | 700 | 0700
+ 2700 | 700 | 0700
+ 1800 | 800 | 0800
+ 2800 | 800 | 0800
+ 1900 | 900 | 0900
+ 2900 | 900 | 0900
+(20 rows)
+
 -- Test error handling, if accessing one of the foreign partitions errors out
 CREATE FOREIGN TABLE async_p_broken PARTITION OF async_pt FOR VALUES FROM (10000) TO (10001)
   SERVER loopback OPTIONS (table_name 'non_existent_table');
@@ -11197,6 +11237,35 @@ SELECT * FROM result_tbl ORDER BY a;
 (3 rows)
 
 DELETE FROM result_tbl;
+-- Test MergeAppend
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM async_pt WHERE b === 505 ORDER BY b, a;
+                                              QUERY PLAN                                              
+------------------------------------------------------------------------------------------------------
+ Merge Append
+   Sort Key: async_pt.b, async_pt.a
+   ->  Async Foreign Scan on public.async_p1 async_pt_1
+         Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+         Filter: (async_pt_1.b === 505)
+         Remote SQL: SELECT a, b, c FROM public.base_tbl1 ORDER BY b ASC NULLS LAST, a ASC NULLS LAST
+   ->  Async Foreign Scan on public.async_p2 async_pt_2
+         Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
+         Filter: (async_pt_2.b === 505)
+         Remote SQL: SELECT a, b, c FROM public.base_tbl2 ORDER BY b ASC NULLS LAST, a ASC NULLS LAST
+   ->  Async Foreign Scan on public.async_p3 async_pt_3
+         Output: async_pt_3.a, async_pt_3.b, async_pt_3.c
+         Filter: (async_pt_3.b === 505)
+         Remote SQL: SELECT a, b, c FROM public.base_tbl3 ORDER BY b ASC NULLS LAST, a ASC NULLS LAST
+(14 rows)
+
+SELECT * FROM async_pt WHERE b === 505 ORDER BY b, a;
+  a   |  b  |  c   
+------+-----+------
+ 1505 | 505 | 0505
+ 2505 | 505 | 0505
+ 3505 | 505 | 0505
+(3 rows)
+
 DROP FOREIGN TABLE async_p3;
 DROP TABLE base_tbl3;
 -- Check case where the partitioned table has local/remote partitions
@@ -11232,6 +11301,37 @@ SELECT * FROM result_tbl ORDER BY a;
 (3 rows)
 
 DELETE FROM result_tbl;
+-- Test MergeAppend
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM async_pt WHERE b === 505 ORDER BY b, a;
+                                              QUERY PLAN                                              
+------------------------------------------------------------------------------------------------------
+ Merge Append
+   Sort Key: async_pt.b, async_pt.a
+   ->  Async Foreign Scan on public.async_p1 async_pt_1
+         Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
+         Filter: (async_pt_1.b === 505)
+         Remote SQL: SELECT a, b, c FROM public.base_tbl1 ORDER BY b ASC NULLS LAST, a ASC NULLS LAST
+   ->  Async Foreign Scan on public.async_p2 async_pt_2
+         Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
+         Filter: (async_pt_2.b === 505)
+         Remote SQL: SELECT a, b, c FROM public.base_tbl2 ORDER BY b ASC NULLS LAST, a ASC NULLS LAST
+   ->  Sort
+         Output: async_pt_3.a, async_pt_3.b, async_pt_3.c
+         Sort Key: async_pt_3.b, async_pt_3.a
+         ->  Seq Scan on public.async_p3 async_pt_3
+               Output: async_pt_3.a, async_pt_3.b, async_pt_3.c
+               Filter: (async_pt_3.b === 505)
+(16 rows)
+
+SELECT * FROM async_pt WHERE b === 505 ORDER BY b, a;
+  a   |  b  |  c   
+------+-----+------
+ 1505 | 505 | 0505
+ 2505 | 505 | 0505
+ 3505 | 505 | 0505
+(3 rows)
+
 -- partitionwise joins
 SET enable_partitionwise_join TO true;
 CREATE TABLE join_tbl (a1 int, b1 int, c1 text, a2 int, b2 int, c2 text);
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 0bb9a5ae8f6..a8ad0eb1d04 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -7252,12 +7252,16 @@ postgresForeignAsyncConfigureWait(AsyncRequest *areq)
 	ForeignScanState *node = (ForeignScanState *) areq->requestee;
 	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
 	AsyncRequest *pendingAreq = fsstate->conn_state->pendingAreq;
-	AppendState *requestor = (AppendState *) areq->requestor;
-	WaitEventSet *set = requestor->as_eventset;
+	PlanState  *requestor = areq->requestor;
+	WaitEventSet *set;
+	Bitmapset  *needrequest;
 
 	/* This should not be called unless callback_pending */
 	Assert(areq->callback_pending);
 
+	set = GetAppendEventSet(requestor);
+	needrequest = GetNeedRequest(requestor);
+
 	/*
 	 * If process_pending_request() has been invoked on the given request
 	 * before we get here, we might have some tuples already; in which case
@@ -7295,7 +7299,7 @@ postgresForeignAsyncConfigureWait(AsyncRequest *areq)
 		 * below, because we might otherwise end up with no configured events
 		 * other than the postmaster death event.
 		 */
-		if (!bms_is_empty(requestor->as_needrequest))
+		if (!bms_is_empty(needrequest))
 			return;
 		if (GetNumRegisteredWaitEvents(set) > 1)
 			return;
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index b57f8cfda68..74400693c08 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -3764,6 +3764,11 @@ INSERT INTO result_tbl SELECT a, b, 'AAA' || c FROM async_pt WHERE b === 505;
 SELECT * FROM result_tbl ORDER BY a;
 DELETE FROM result_tbl;
 
+-- Test MergeAppend
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM async_pt WHERE b % 100 = 0 ORDER BY b, a;
+SELECT * FROM async_pt WHERE b % 100 = 0 ORDER BY b, a;
+
 -- Test error handling, if accessing one of the foreign partitions errors out
 CREATE FOREIGN TABLE async_p_broken PARTITION OF async_pt FOR VALUES FROM (10000) TO (10001)
   SERVER loopback OPTIONS (table_name 'non_existent_table');
@@ -3784,6 +3789,11 @@ INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
 SELECT * FROM result_tbl ORDER BY a;
 DELETE FROM result_tbl;
 
+-- Test MergeAppend
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM async_pt WHERE b === 505 ORDER BY b, a;
+SELECT * FROM async_pt WHERE b === 505 ORDER BY b, a;
+
 DROP FOREIGN TABLE async_p3;
 DROP TABLE base_tbl3;
 
@@ -3799,6 +3809,11 @@ INSERT INTO result_tbl SELECT * FROM async_pt WHERE b === 505;
 SELECT * FROM result_tbl ORDER BY a;
 DELETE FROM result_tbl;
 
+-- Test MergeAppend
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM async_pt WHERE b === 505 ORDER BY b, a;
+SELECT * FROM async_pt WHERE b === 505 ORDER BY b, a;
+
 -- partitionwise joins
 SET enable_partitionwise_join TO true;
 
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b14c5d81a15..77b03723dd7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5310,6 +5310,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-async-merge-append" xreflabel="enable_async_merge_append">
+      <term><varname>enable_async_merge_append</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_async_merge_append</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of async-aware
+        merge append plan types. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-bitmapscan" xreflabel="enable_bitmapscan">
       <term><varname>enable_bitmapscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 6129b803370..9d4a4bfa7c1 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -17,6 +17,7 @@
 #include "executor/execAsync.h"
 #include "executor/executor.h"
 #include "executor/nodeAppend.h"
+#include "executor/nodeMergeAppend.h"
 #include "executor/nodeForeignscan.h"
 
 /*
@@ -121,6 +122,9 @@ ExecAsyncResponse(AsyncRequest *areq)
 		case T_AppendState:
 			ExecAsyncAppendResponse(areq);
 			break;
+		case T_MergeAppendState:
+			ExecAsyncMergeAppendResponse(areq);
+			break;
 		default:
 			/* If the node doesn't support async, caller messed up. */
 			elog(ERROR, "unrecognized node type: %d",
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index e1b9b984a7a..7743368c20d 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -39,10 +39,15 @@
 #include "postgres.h"
 
 #include "executor/executor.h"
+#include "executor/execAsync.h"
 #include "executor/execPartition.h"
 #include "executor/nodeMergeAppend.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
+#include "storage/latch.h"
+#include "utils/wait_event.h"
+
+#define EVENT_BUFFER_SIZE                     16
 
 /*
  * We have one slot for each item in the heap array.  We use SlotNumber
@@ -54,6 +59,12 @@ typedef int32 SlotNumber;
 static TupleTableSlot *ExecMergeAppend(PlanState *pstate);
 static int	heap_compare_slots(Datum a, Datum b, void *arg);
 
+static void classify_matching_subplans(MergeAppendState *node);
+static void ExecMergeAppendAsyncBegin(MergeAppendState *node);
+static void ExecMergeAppendAsyncGetNext(MergeAppendState *node, int mplan);
+static bool ExecMergeAppendAsyncRequest(MergeAppendState *node, int mplan);
+static void ExecMergeAppendAsyncEventWait(MergeAppendState *node);
+
 
 /* ----------------------------------------------------------------
  *		ExecInitMergeAppend
@@ -70,6 +81,8 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 	int			nplans;
 	int			i,
 				j;
+	Bitmapset  *asyncplans;
+	int			nasyncplans;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -104,7 +117,10 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 		 * later calls to ExecFindMatchingSubPlans.
 		 */
 		if (!prunestate->do_exec_prune && nplans > 0)
+		{
 			mergestate->ms_valid_subplans = bms_add_range(NULL, 0, nplans - 1);
+			mergestate->ms_valid_subplans_identified = true;
+		}
 	}
 	else
 	{
@@ -117,6 +133,7 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 		Assert(nplans > 0);
 		mergestate->ms_valid_subplans = validsubplans =
 			bms_add_range(NULL, 0, nplans - 1);
+		mergestate->ms_valid_subplans_identified = true;
 		mergestate->ms_prune_state = NULL;
 	}
 
@@ -145,16 +162,69 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 	 * the results into the mergeplanstates array.
 	 */
 	j = 0;
+	asyncplans = NULL;
+	nasyncplans = 0;
+
 	i = -1;
 	while ((i = bms_next_member(validsubplans, i)) >= 0)
 	{
 		Plan	   *initNode = (Plan *) list_nth(node->mergeplans, i);
 
+		/*
+		 * Record async subplans.  When executing EvalPlanQual, we treat them
+		 * as sync ones; don't do this when initializing an EvalPlanQual plan
+		 * tree.
+		 */
+		if (initNode->async_capable && estate->es_epq_active == NULL)
+		{
+			asyncplans = bms_add_member(asyncplans, j);
+			nasyncplans++;
+		}
+
 		mergeplanstates[j++] = ExecInitNode(initNode, estate, eflags);
 	}
 
 	mergestate->ps.ps_ProjInfo = NULL;
 
+	/* Initialize async state */
+	mergestate->ms_asyncplans = asyncplans;
+	mergestate->ms_nasyncplans = nasyncplans;
+	mergestate->ms_asyncrequests = NULL;
+	mergestate->ms_asyncresults = NULL;
+	mergestate->ms_has_asyncresults = NULL;
+	mergestate->ms_asyncremain = NULL;
+	mergestate->ms_needrequest = NULL;
+	mergestate->ms_eventset = NULL;
+	mergestate->ms_valid_asyncplans = NULL;
+
+	if (nasyncplans > 0)
+	{
+		mergestate->ms_asyncrequests = (AsyncRequest **)
+			palloc0(nplans * sizeof(AsyncRequest *));
+
+		i = -1;
+		while ((i = bms_next_member(asyncplans, i)) >= 0)
+		{
+			AsyncRequest *areq;
+
+			areq = palloc(sizeof(AsyncRequest));
+			areq->requestor = (PlanState *) mergestate;
+			areq->requestee = mergeplanstates[i];
+			areq->request_index = i;
+			areq->callback_pending = false;
+			areq->request_complete = false;
+			areq->result = NULL;
+
+			mergestate->ms_asyncrequests[i] = areq;
+		}
+
+		mergestate->ms_asyncresults = (TupleTableSlot **)
+			palloc0(nplans * sizeof(TupleTableSlot *));
+
+		if (mergestate->ms_valid_subplans_identified)
+			classify_matching_subplans(mergestate);
+	}
+
 	/*
 	 * initialize sort-key information
 	 */
@@ -216,9 +286,16 @@ ExecMergeAppend(PlanState *pstate)
 		 * run-time pruning is disabled then the valid subplans will always be
 		 * set to all subplans.
 		 */
-		if (node->ms_valid_subplans == NULL)
+		if (!node->ms_valid_subplans_identified)
+		{
 			node->ms_valid_subplans =
 				ExecFindMatchingSubPlans(node->ms_prune_state, false);
+			node->ms_valid_subplans_identified = true;
+		}
+
+		/* If there are any async subplans, begin executing them. */
+		if (node->ms_nasyncplans > 0)
+			ExecMergeAppendAsyncBegin(node);
 
 		/*
 		 * First time through: pull the first tuple from each valid subplan,
@@ -231,6 +308,16 @@ ExecMergeAppend(PlanState *pstate)
 			if (!TupIsNull(node->ms_slots[i]))
 				binaryheap_add_unordered(node->ms_heap, Int32GetDatum(i));
 		}
+
+		/* Look at async subplans */
+		i = -1;
+		while ((i = bms_next_member(node->ms_asyncplans, i)) >= 0)
+		{
+			ExecMergeAppendAsyncGetNext(node, i);
+			if (!TupIsNull(node->ms_slots[i]))
+				binaryheap_add_unordered(node->ms_heap, Int32GetDatum(i));
+		}
+
 		binaryheap_build(node->ms_heap);
 		node->ms_initialized = true;
 	}
@@ -245,7 +332,13 @@ ExecMergeAppend(PlanState *pstate)
 		 * to not pull tuples until necessary.)
 		 */
 		i = DatumGetInt32(binaryheap_first(node->ms_heap));
-		node->ms_slots[i] = ExecProcNode(node->mergeplans[i]);
+		if (bms_is_member(i, node->ms_asyncplans))
+			ExecMergeAppendAsyncGetNext(node, i);
+		else
+		{
+			Assert(bms_is_member(i, node->ms_valid_subplans));
+			node->ms_slots[i] = ExecProcNode(node->mergeplans[i]);
+		}
 		if (!TupIsNull(node->ms_slots[i]))
 			binaryheap_replace_first(node->ms_heap, Int32GetDatum(i));
 		else
@@ -261,6 +354,8 @@ ExecMergeAppend(PlanState *pstate)
 	{
 		i = DatumGetInt32(binaryheap_first(node->ms_heap));
 		result = node->ms_slots[i];
+		/*  For async plan record that we can get the next tuple */
+		node->ms_has_asyncresults = bms_del_member(node->ms_has_asyncresults, i);
 	}
 
 	return result;
@@ -340,6 +435,7 @@ void
 ExecReScanMergeAppend(MergeAppendState *node)
 {
 	int			i;
+	int			nasyncplans = node->ms_nasyncplans;
 
 	/*
 	 * If any PARAM_EXEC Params used in pruning expressions have changed, then
@@ -350,8 +446,11 @@ ExecReScanMergeAppend(MergeAppendState *node)
 		bms_overlap(node->ps.chgParam,
 					node->ms_prune_state->execparamids))
 	{
+		node->ms_valid_subplans_identified = false;
 		bms_free(node->ms_valid_subplans);
 		node->ms_valid_subplans = NULL;
+		bms_free(node->ms_valid_asyncplans);
+		node->ms_valid_asyncplans = NULL;
 	}
 
 	for (i = 0; i < node->ms_nplans; i++)
@@ -372,6 +471,361 @@ ExecReScanMergeAppend(MergeAppendState *node)
 		if (subnode->chgParam == NULL)
 			ExecReScan(subnode);
 	}
+
+	/* Reset async state */
+	if (nasyncplans > 0)
+	{
+		i = -1;
+		while ((i = bms_next_member(node->ms_asyncplans, i)) >= 0)
+		{
+			AsyncRequest *areq = node->ms_asyncrequests[i];
+
+			areq->callback_pending = false;
+			areq->request_complete = false;
+			areq->result = NULL;
+		}
+
+		bms_free(node->ms_asyncremain);
+		node->ms_asyncremain = NULL;
+		bms_free(node->ms_needrequest);
+		node->ms_needrequest = NULL;
+		bms_free(node->ms_has_asyncresults);
+		node->ms_has_asyncresults = NULL;
+	}
 	binaryheap_reset(node->ms_heap);
 	node->ms_initialized = false;
 }
+
+/* ----------------------------------------------------------------
+ *              classify_matching_subplans
+ *
+ *              Classify the node's ms_valid_subplans into sync ones and
+ *              async ones, adjust it to contain sync ones only, and save
+ *              async ones in the node's ms_valid_asyncplans.
+ * ----------------------------------------------------------------
+ */
+static void
+classify_matching_subplans(MergeAppendState *node)
+{
+	Bitmapset  *valid_asyncplans;
+
+	Assert(node->ms_valid_subplans_identified);
+	Assert(node->ms_valid_asyncplans == NULL);
+
+	/* Nothing to do if there are no valid subplans. */
+	if (bms_is_empty(node->ms_valid_subplans))
+	{
+		node->ms_asyncremain = NULL;
+		return;
+	}
+
+	/* Nothing to do if there are no valid async subplans. */
+	if (!bms_overlap(node->ms_valid_subplans, node->ms_asyncplans))
+	{
+		node->ms_asyncremain = NULL;
+		return;
+	}
+
+	/* Get valid async subplans. */
+	valid_asyncplans = bms_intersect(node->ms_asyncplans,
+								   node->ms_valid_subplans);
+
+	/* Adjust the valid subplans to contain sync subplans only. */
+	node->ms_valid_subplans = bms_del_members(node->ms_valid_subplans,
+											  valid_asyncplans);
+
+	/* Save valid async subplans. */
+	node->ms_valid_asyncplans = valid_asyncplans;
+}
+
+/* ----------------------------------------------------------------
+ *              ExecMergeAppendAsyncBegin
+ *
+ *              Begin executing designed async-capable subplans.
+ * ----------------------------------------------------------------
+ */
+static void
+ExecMergeAppendAsyncBegin(MergeAppendState *node)
+{
+	int			i;
+
+	/* Backward scan is not supported by async-aware MergeAppends. */
+	Assert(ScanDirectionIsForward(node->ps.state->es_direction));
+
+	/* We should never be called when there are no subplans */
+	Assert(node->ms_nplans > 0);
+
+	/* We should never be called when there are no async subplans. */
+	Assert(node->ms_nasyncplans > 0);
+
+	/* If we've yet to determine the valid subplans then do so now. */
+	if (!node->ms_valid_subplans_identified)
+	{
+		node->ms_valid_subplans =
+			ExecFindMatchingSubPlans(node->ms_prune_state, false);
+		node->ms_valid_subplans_identified = true;
+
+		classify_matching_subplans(node);
+	}
+
+	/* Initialize state variables. */
+	node->ms_asyncremain = bms_copy(node->ms_valid_asyncplans);
+
+	/* Nothing to do if there are no valid async subplans. */
+	if (bms_is_empty(node->ms_asyncremain))
+		return;
+
+	/* Make a request for each of the valid async subplans. */
+	i = -1;
+	while ((i = bms_next_member(node->ms_valid_asyncplans, i)) >= 0)
+	{
+		AsyncRequest *areq = node->ms_asyncrequests[i];
+
+		Assert(areq->request_index == i);
+		Assert(!areq->callback_pending);
+
+		/* Do the actual work. */
+		ExecAsyncRequest(areq);
+	}
+}
+
+/* ----------------------------------------------------------------
+ *              ExecMergeAppendAsyncGetNext
+ *
+ *              Get the next tuple from specified asynchronous subplan.
+ * ----------------------------------------------------------------
+ */
+static void
+ExecMergeAppendAsyncGetNext(MergeAppendState *node, int mplan)
+{
+	node->ms_slots[mplan] = NULL;
+
+	/* Request a tuple asynchronously. */
+	if (ExecMergeAppendAsyncRequest(node, mplan))
+		return;
+
+	/*
+	 * node->ms_asyncremain can be NULL if we have fetched tuples, but haven't
+	 * returned them yet. In this case ExecMergeAppendAsyncRequest() above just
+	 * returns tuples without performing a request.
+	 */
+	while (bms_is_member(mplan, node->ms_asyncremain))
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		/* Wait or poll for async events. */
+		ExecMergeAppendAsyncEventWait(node);
+
+		/* Request a tuple asynchronously. */
+		if (ExecMergeAppendAsyncRequest(node, mplan))
+			return;
+
+		/*
+		 * Waiting until there's no async requests pending or we got some
+		 * tuples from our request
+		 */
+	}
+
+	/* No tuples */
+	return;
+}
+
+/* ----------------------------------------------------------------
+ *              ExecMergeAppendAsyncRequest
+ *
+ *              Request a tuple asynchronously.
+ * ----------------------------------------------------------------
+ */
+static bool
+ExecMergeAppendAsyncRequest(MergeAppendState *node, int mplan)
+{
+	Bitmapset  *needrequest;
+	int			i;
+
+	/*
+	 * If we've already fetched necessary data, just return it
+	 */
+	if (bms_is_member(mplan, node->ms_has_asyncresults))
+	{
+		node->ms_slots[mplan] = node->ms_asyncresults[mplan];
+		return true;
+	}
+
+	/*
+	 * Get a list of members which can process request and don't have data
+	 * ready.
+	 */
+	needrequest = NULL;
+	i = -1;
+	while ((i = bms_next_member(node->ms_needrequest, i)) >= 0)
+	{
+		if (!bms_is_member(i, node->ms_has_asyncresults))
+			needrequest = bms_add_member(needrequest, i);
+	}
+
+	/*
+	 * If there's no members, which still need request, no need to send it.
+	 */
+	if (bms_is_empty(needrequest))
+		return false;
+
+	/* Clear ms_needrequest flag, as we are going to send requests now */
+	node->ms_needrequest = bms_del_members(node->ms_needrequest, needrequest);
+
+	/* Make a new request for each of the async subplans that need it. */
+	i = -1;
+	while ((i = bms_next_member(needrequest, i)) >= 0)
+	{
+		AsyncRequest *areq = node->ms_asyncrequests[i];
+
+		/*
+		 * We've just checked that subplan doesn't already have some fetched
+		 * data
+		 */
+		Assert(!bms_is_member(i, node->ms_has_asyncresults));
+
+		/* Do the actual work. */
+		ExecAsyncRequest(areq);
+	}
+	bms_free(needrequest);
+
+	/* Return needed asynchronously-generated results if any. */
+	if (bms_is_member(mplan, node->ms_has_asyncresults))
+	{
+		node->ms_slots[mplan] = node->ms_asyncresults[mplan];
+		return true;
+	}
+
+	return false;
+}
+
+/* ----------------------------------------------------------------
+ *              ExecAsyncMergeAppendResponse
+ *
+ *              Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncMergeAppendResponse(AsyncRequest *areq)
+{
+	MergeAppendState *node = (MergeAppendState *) areq->requestor;
+	TupleTableSlot *slot = areq->result;
+
+	/* The result should be a TupleTableSlot or NULL. */
+	Assert(slot == NULL || IsA(slot, TupleTableSlot));
+	Assert(!bms_is_member(areq->request_index, node->ms_has_asyncresults));
+
+	node->ms_asyncresults[areq->request_index] = NULL;
+	/* Nothing to do if the request is pending. */
+	if (!areq->request_complete)
+	{
+		/* The request would have been pending for a callback. */
+		Assert(areq->callback_pending);
+		return;
+	}
+
+	/* If the result is NULL or an empty slot, there's nothing more to do. */
+	if (TupIsNull(slot))
+	{
+		/* The ending subplan wouldn't have been pending for a callback. */
+		Assert(!areq->callback_pending);
+		node->ms_asyncremain = bms_del_member(node->ms_asyncremain, areq->request_index);
+		return;
+	}
+
+	node->ms_has_asyncresults = bms_add_member(node->ms_has_asyncresults, areq->request_index);
+	/* Save result so we can return it. */
+	node->ms_asyncresults[areq->request_index] = slot;
+
+	/*
+	 * Mark the subplan that returned a result as ready for a new request.  We
+	 * don't launch another one here immediately because it might complete.
+	 */
+	node->ms_needrequest = bms_add_member(node->ms_needrequest,
+										  areq->request_index);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecMergeAppendAsyncEventWait
+ *
+ *		Wait or poll for file descriptor events and fire callbacks.
+ * ----------------------------------------------------------------
+ */
+static void
+ExecMergeAppendAsyncEventWait(MergeAppendState *node)
+{
+	int			nevents = node->ms_nasyncplans + 1; /* one for PM death */
+	WaitEvent	occurred_event[EVENT_BUFFER_SIZE];
+	int			noccurred;
+	int			i;
+
+	/* We should never be called when there are no valid async subplans. */
+	Assert(bms_num_members(node->ms_asyncremain) > 0);
+
+	node->ms_eventset = CreateWaitEventSet(CurrentResourceOwner, nevents);
+	AddWaitEventToSet(node->ms_eventset, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET,
+					  NULL, NULL);
+
+	/* Give each waiting subplan a chance to add an event. */
+	i = -1;
+	while ((i = bms_next_member(node->ms_asyncplans, i)) >= 0)
+	{
+		AsyncRequest *areq = node->ms_asyncrequests[i];
+
+		if (areq->callback_pending)
+			ExecAsyncConfigureWait(areq);
+	}
+
+	/*
+	 * No need for further processing if there are no configured events other
+	 * than the postmaster death event.
+	 */
+	if (GetNumRegisteredWaitEvents(node->ms_eventset) == 1)
+	{
+		FreeWaitEventSet(node->ms_eventset);
+		node->ms_eventset = NULL;
+		return;
+	}
+
+	/* We wait on at most EVENT_BUFFER_SIZE events. */
+	if (nevents > EVENT_BUFFER_SIZE)
+		nevents = EVENT_BUFFER_SIZE;
+
+	/*
+	 * Wait until at least one event occurs.
+	 */
+	noccurred = WaitEventSetWait(node->ms_eventset, -1 /* no timeout */, occurred_event,
+								 nevents, WAIT_EVENT_APPEND_READY);
+	FreeWaitEventSet(node->ms_eventset);
+	node->ms_eventset = NULL;
+	if (noccurred == 0)
+		return;
+
+	/* Deliver notifications. */
+	for (i = 0; i < noccurred; i++)
+	{
+		WaitEvent  *w = &occurred_event[i];
+
+		/*
+		 * Each waiting subplan should have registered its wait event with
+		 * user_data pointing back to its AsyncRequest.
+		 */
+		if ((w->events & WL_SOCKET_READABLE) != 0)
+		{
+			AsyncRequest *areq = (AsyncRequest *) w->user_data;
+
+			if (areq->callback_pending)
+			{
+				/*
+				 * Mark it as no longer needing a callback.  We must do this
+				 * before dispatching the callback in case the callback resets
+				 * the flag.
+				 */
+				areq->callback_pending = false;
+
+				/* Do the actual work. */
+				ExecAsyncNotify(areq);
+			}
+		}
+	}
+}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index ee23ed7835d..86347b20839 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -152,6 +152,7 @@ bool		enable_parallel_hash = true;
 bool		enable_partition_pruning = true;
 bool		enable_presorted_aggregate = true;
 bool		enable_async_append = true;
+bool		enable_async_merge_append = true;
 
 typedef struct
 {
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index c2e879815b1..54c7d8388af 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1447,6 +1447,7 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path,
 	ListCell   *subpaths;
 	RelOptInfo *rel = best_path->path.parent;
 	PartitionPruneInfo *partpruneinfo = NULL;
+	bool		consider_async = false;
 
 	/*
 	 * We don't have the actual creation of the MergeAppend node split out
@@ -1461,6 +1462,9 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path,
 	plan->righttree = NULL;
 	node->apprelids = rel->relids;
 
+	consider_async = (enable_async_merge_append &&
+					  !best_path->path.parallel_safe &&
+					  list_length(best_path->subpaths) > 1);
 	/*
 	 * Compute sort column info, and adjust MergeAppend's tlist as needed.
 	 * Because we pass adjust_tlist_in_place = true, we may ignore the
@@ -1536,6 +1540,10 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path,
 			subplan = (Plan *) sort;
 		}
 
+		/* If needed, check to see if subplan can be executed asynchronously */
+		if (consider_async)
+			mark_async_capable_plan(subplan, subpath);
+
 		subplans = lappend(subplans, subplan);
 	}
 
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 630ed0f1629..0dddac70406 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -988,6 +988,16 @@ struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_async_merge_append", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of async merge append plans."),
+			NULL,
+			GUC_EXPLAIN
+		},
+		&enable_async_merge_append,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_group_by_reordering", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables reordering of GROUP BY keys."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9ec9f97e926..b4e8b78925c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -392,6 +392,7 @@
 # - Planner Method Configuration -
 
 #enable_async_append = on
+#enable_async_merge_append = on
 #enable_bitmapscan = on
 #enable_gathermerge = on
 #enable_hashagg = on
diff --git a/src/include/executor/nodeMergeAppend.h b/src/include/executor/nodeMergeAppend.h
index c39a9fc0cb2..4371426bea1 100644
--- a/src/include/executor/nodeMergeAppend.h
+++ b/src/include/executor/nodeMergeAppend.h
@@ -19,5 +19,6 @@
 extern MergeAppendState *ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags);
 extern void ExecEndMergeAppend(MergeAppendState *node);
 extern void ExecReScanMergeAppend(MergeAppendState *node);
+extern void ExecAsyncMergeAppendResponse(AsyncRequest *areq);
 
 #endif							/* NODEMERGEAPPEND_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index cac684d9b3a..440ab40f9e9 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1487,10 +1487,66 @@ typedef struct MergeAppendState
 	TupleTableSlot **ms_slots;	/* array of length ms_nplans */
 	struct binaryheap *ms_heap; /* binary heap of slot indices */
 	bool		ms_initialized; /* are subplans started? */
+	Bitmapset  *ms_asyncplans;	/* asynchronous plans indexes */
+	int			ms_nasyncplans; /* # of asynchronous plans */
+	AsyncRequest **ms_asyncrequests;	/* array of AsyncRequests */
+	TupleTableSlot **ms_asyncresults;	/* unreturned results of async plans */
+	Bitmapset   *ms_has_asyncresults;	/* plans which have async results */
+	Bitmapset  *ms_asyncremain; /* remaining asynchronous plans */
+	Bitmapset  *ms_needrequest; /* asynchronous plans needing a new request */
+	struct WaitEventSet *ms_eventset;	/* WaitEventSet used to configure file
+										 * descriptor wait events */
 	struct PartitionPruneState *ms_prune_state;
+	bool		ms_valid_subplans_identified;   /* is ms_valid_subplans valid? */
 	Bitmapset  *ms_valid_subplans;
+	Bitmapset  *ms_valid_asyncplans;	/* valid asynchronous plans indexes */
 } MergeAppendState;
 
+/* Getters for AppendState and MergeAppendState */
+static inline struct WaitEventSet *
+GetAppendEventSet(PlanState *ps)
+{
+	Assert (IsA(ps, AppendState) || IsA(ps, MergeAppendState));
+
+	if (IsA(ps, AppendState))
+		return ((AppendState *)ps)->as_eventset;
+	else
+		return ((MergeAppendState *)ps)->ms_eventset;
+}
+
+static inline Bitmapset *
+GetNeedRequest(PlanState *ps)
+{
+	Assert (IsA(ps, AppendState) || IsA(ps, MergeAppendState));
+
+	if (IsA(ps, AppendState))
+		return ((AppendState *)ps)->as_needrequest;
+	else
+		return ((MergeAppendState *)ps)->ms_needrequest;
+}
+
+static inline Bitmapset *
+GetValidAsyncplans(PlanState *ps)
+{
+	Assert (IsA(ps, AppendState) || IsA(ps, MergeAppendState));
+
+	if (IsA(ps, AppendState))
+		return ((AppendState *)ps)->as_valid_asyncplans;
+	else
+		return ((MergeAppendState *)ps)->ms_valid_asyncplans;
+}
+
+static inline AsyncRequest*
+GetValidAsyncRequest(PlanState *ps, int nreq)
+{
+	Assert (IsA(ps, AppendState) || IsA(ps, MergeAppendState));
+
+	if (IsA(ps, AppendState))
+		return ((AppendState *)ps)->as_asyncrequests[nreq];
+	else
+		return ((MergeAppendState *)ps)->ms_asyncrequests[nreq];
+}
+
 /* ----------------
  *	 RecursiveUnionState information
  *
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b1c51a4e70f..ee191ae445a 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -70,6 +70,7 @@ extern PGDLLIMPORT bool enable_parallel_hash;
 extern PGDLLIMPORT bool enable_partition_pruning;
 extern PGDLLIMPORT bool enable_presorted_aggregate;
 extern PGDLLIMPORT bool enable_async_append;
+extern PGDLLIMPORT bool enable_async_merge_append;
 extern PGDLLIMPORT int constraint_exclusion;
 
 extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 729620de13c..24675cf1982 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -135,6 +135,7 @@ select name, setting from pg_settings where name like 'enable%';
               name              | setting 
 --------------------------------+---------
  enable_async_append            | on
+ enable_async_merge_append      | on
  enable_bitmapscan              | on
  enable_gathermerge             | on
  enable_group_by_reordering     | on
@@ -156,7 +157,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(22 rows)
+(23 rows)
 
 -- There are always wait event descriptions for various types.  InjectionPoint
 -- may be present or absent, depending on history since last postmaster start.
-- 
2.34.1

Reply via email to