GitHub user avamingli edited a discussion: Make UNION Parallel

### Description

In PostgreSQL, the UNION operator can leverage parallel processing through the 
Parallel Append node with subnodes. 

```sql
explain (costs off, verbose) select * from t1 union select * from t2;
                    QUERY PLAN
--------------------------------------------------
 HashAggregate
   Output: t1.a, t1.b
   Group Key: t1.a, t1.b
   ->  Gather
         Output: t1.a, t1.b
         Workers Planned: 3
         ->  Parallel Append
               ->  Parallel Seq Scan on public.t1
                     Output: t1.a, t1.b
               ->  Parallel Seq Scan on public.t2
                     Output: t2.a, t2.b
(11 rows)

```


However, in CBDB, we face challenges when UNION subqueries include Motion 
nodes, which can lead to incorrect results if Parallel Append is used. This is 
due to the competition among workers for subnodes: some subnodes are executed 
by a single worker while others may be processed by multiple workers.

When a worker completes a task for a multi-worker subnode, it marks that job as 
finished. 
While this is acceptable in PostgreSQL, but in CBDB, the presence of Motion 
nodes complicates this process. The premature marking of completion can cause 
the Motion Sender to stop early, resulting in data loss for other workers.

For cases involving only Scan nodes—such as Parallel Scans on partitioned 
tables—this issue does not arise. 
However, for UNION ALL and similar scenarios, the use of Parallel Append is 
unsafe. That's another problem and we should consider disabling it in the 
future.

Despite these challenges, as an MPP database, CBDB has the potential to support 
parallel processing for UNION operations. This can be achieved if multiple 
workers execute Append nodes within a slice and the data is well-distributed 
(hashed across multiple segments relative to the cluster) according to the 
output columns of the subqueries. 
Specifically, we can implement a Parallel-oblivious Append approach, where 
multiple workers operate independently without sharing state, rather than a 
Parallel-aware Append that requires coordination among workers.



### Use case/motivation

_No response_

### Related issues

_No response_

### Are you willing to submit a PR?

- [X] Yes I am willing to submit a PR!

GitHub link: https://github.com/apache/cloudberry/discussions/1202

----
This is an automatically sent email for dev@cloudberry.apache.org.
To unsubscribe, please send an email to: dev-unsubscr...@cloudberry.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cloudberry.apache.org
For additional commands, e-mail: dev-h...@cloudberry.apache.org

Reply via email to