GitHub user my-ship-it edited a discussion: [Ideas] Should we support ON CONFLICT when updating the distribution keys?
### Description Hackers, Currently, when the ON CONFLICT statement involves distributed keys, an error occurs, as shown below: ``` postgres=# create table tbl(v1 int, v2 int, v3 int, CONSTRAINT xx primary key(v1, v2)) distributed by(v1, v2); CREATE TABLE postgres=# explain insert into tbl values(1, 2) ON CONFLICT(v1, v2) DO UPDATE SET v2=EXCLUDED.v2; ERROR: modification of distribution columns in OnConflictUpdate is not supported ``` However, it can work if it does not include distribution keys ``` postgres=# create table tbl(v1 int, v2 int, v3 int, CONSTRAINT xx primary key(v1, v2)) distributed by(v1); CREATE TABLE postgres=# explain insert into tbl values(1, 2) ON CONFLICT(v1, v2) DO UPDATE SET v2=EXCLUDED.v2; QUERY PLAN ------------------------------------------------- Insert on tbl (cost=0.00..0.03 rows=0 width=0) Conflict Resolution: UPDATE Conflict Arbiter Indexes: xx -> Result (cost=0.00..0.01 rows=1 width=12) Optimizer: Postgres query optimizer (5 rows) It seems that the conflict operation of PG is implemented in the executor phase. ``` This is because we did some strict checks in the function sanity_check_on_conflict_update, requiring that the columns updated on conflict cannot include distribution keys. At least, we could do some simple optimizations and more detailed checks, such as if the distribution key is not updated, error won't be thrown out. Furthermore, if there is a modification of the distribution key, we can borrow the way of SplitUpdate, introduce a new executor OnConflictSplitUpdate, and convert the UPSERT operation to INSERT or DELETE + INSERT according to the index check result. If the OnConflictSplitUpdate node checks the index and finds a conflict, it will generate two operations, DELETE + INSERT, just like the SplitUpdate node, otherwise only INSERT will be generated. The final update is completed by Motion to the nodeModifyTable node (writer Gang) of the upper layer ``` postgres=# explain insert into tbl values(1, 2) ON CONFLICT(v1, v2) DO UPDATE SET v1 = EXCLUDED.v1 + 1, v2=EXCLUDED.v1 + EXCLUDED.v2; QUERY PLAN ---------------------------------------------------------------------------------------------------- ----- Update on public.tbl (cost=0.00..1397.25 rows=0 width=0) -> Explicit Redistribute Motion 3:3 (slice1; segments: 3) (cost=0.00..1397.25 rows=51933 width =22) Output: ((1 + 1)), ((1 + 2)), ctid, gp_segment_id, (DMLAction) -> OnConflictSplit (cost=0.00..358.58 rows=51933 width=22) Conflict Resolution: UPDATE Conflict Arbiter Indexes: xx Output: ((1 + 1)), ((1 + 2)), ctid, gp_segment_id, DMLAction -> Seq Scan on public.tbl (cost=0.00..358.58 rows=25967 width=22) Output: 1, 2, ctid, gp_segment_id (9 rows) ``` Additional Motion will introduce costs, and we don't need to generate OnConflictSplit node every time, such as when the distribution key is not updated. We need to implement in both GPORCA and legacy planner, but could implement in legacy planner first. Of course, like the SplitUpdate type, not all queries can support this method, and we need to do some detailed filtering and processing. Any ideas are welcome. ### Use case/motivation _No response_ ### Related issues _No response_ ### Are you willing to submit a PR? - [ ] Yes I am willing to submit a PR! GitHub link: https://github.com/apache/cloudberry/discussions/902 ---- This is an automatically sent email for dev@cloudberry.apache.org. To unsubscribe, please send an email to: dev-unsubscr...@cloudberry.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@cloudberry.apache.org For additional commands, e-mail: dev-h...@cloudberry.apache.org