GitHub user my-ship-it edited a discussion: [Ideas] Should we support ON 
CONFLICT when updating the distribution keys?

### Description

Hackers,

Currently, when the ON CONFLICT statement involves distributed keys, an error 
occurs, as shown below:

```
postgres=# create table tbl(v1 int, v2 int, v3 int, CONSTRAINT xx primary 
key(v1, v2)) distributed by(v1, v2);
CREATE TABLE
postgres=# explain insert into tbl values(1, 2) ON CONFLICT(v1, v2) DO UPDATE 
SET v2=EXCLUDED.v2;
ERROR:  modification of distribution columns in OnConflictUpdate is not 
supported
```

However, it can work if it does not include distribution keys

```
postgres=# create table tbl(v1 int, v2 int, v3 int, CONSTRAINT xx primary 
key(v1, v2)) distributed by(v1);
CREATE TABLE
postgres=# explain insert into tbl values(1, 2) ON CONFLICT(v1, v2) DO UPDATE 
SET v2=EXCLUDED.v2;
                   QUERY PLAN
-------------------------------------------------
 Insert on tbl  (cost=0.00..0.03 rows=0 width=0)
   Conflict Resolution: UPDATE
   Conflict Arbiter Indexes: xx
   ->  Result  (cost=0.00..0.01 rows=1 width=12)
 Optimizer: Postgres query optimizer
(5 rows)

It seems that the conflict operation of PG is implemented in the executor phase.
```

This is because we did some strict checks in the function 
sanity_check_on_conflict_update, requiring that the columns updated on conflict 
cannot include distribution keys.
At least, we could do some simple optimizations and more detailed checks, such 
as if the distribution key is not updated,
error won't be thrown out.

Furthermore, if there is a modification of the distribution key, we can borrow 
the way of SplitUpdate, introduce a new executor OnConflictSplitUpdate, and 
convert the UPSERT operation to INSERT or DELETE + INSERT according to the 
index check result. If the OnConflictSplitUpdate node checks the index and 
finds a conflict, it will generate two operations, DELETE + INSERT, just like 
the SplitUpdate node, otherwise only INSERT will be generated. The final update 
is completed by Motion to the nodeModifyTable node (writer Gang) of the upper 
layer
```
postgres=# explain insert into tbl values(1, 2) ON CONFLICT(v1, v2) DO UPDATE 
SET v1 = EXCLUDED.v1 + 1, v2=EXCLUDED.v1 + EXCLUDED.v2;
                                               QUERY PLAN

----------------------------------------------------------------------------------------------------
-----
 Update on public.tbl  (cost=0.00..1397.25 rows=0 width=0)
   ->  Explicit Redistribute Motion 3:3  (slice1; segments: 3)  
(cost=0.00..1397.25 rows=51933 width
=22)
         Output: ((1 + 1)), ((1 + 2)), ctid, gp_segment_id, (DMLAction)
         ->  OnConflictSplit  (cost=0.00..358.58 rows=51933 width=22)
               Conflict Resolution: UPDATE
               Conflict Arbiter Indexes: xx
               Output: ((1 + 1)), ((1 + 2)), ctid, gp_segment_id, DMLAction
               ->  Seq Scan on public.tbl  (cost=0.00..358.58 rows=25967 
width=22)
                     Output: 1, 2, ctid, gp_segment_id
(9 rows)
```

Additional Motion will  introduce costs, and we don't need to generate 
OnConflictSplit node every time, such as when the distribution key is not 
updated.
We need to implement in both GPORCA and legacy planner, but could implement in 
legacy planner first.
Of course, like the SplitUpdate type, not all queries can support this method, 
and we need to do some detailed filtering and processing.

Any ideas are welcome.

### Use case/motivation

_No response_

### Related issues

_No response_

### Are you willing to submit a PR?

- [ ] Yes I am willing to submit a PR!

GitHub link: https://github.com/apache/cloudberry/discussions/902

----
This is an automatically sent email for dev@cloudberry.apache.org.
To unsubscribe, please send an email to: dev-unsubscr...@cloudberry.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cloudberry.apache.org
For additional commands, e-mail: dev-h...@cloudberry.apache.org

Reply via email to