Re: Merge David and Goliath tables efficiently

Tomas Vondra Sat, 17 Jun 2023 12:52:29 -0700

On 6/17/23 15:48, Nicolas Paris wrote:
> In my use case I have a 2billion / 1To table. I have daily data to upsert 
> around 2milion, with say 50% inserts, based on the primary key in a fresh 
> analyzed table.
> 
> I have tested multiple strategies to merge the data, all based on first stage 
> to copy the 2m dataset in an staging unlogged / indexed table:
> 
> 1. Join insert then join update 
> 2.1. Usage of the new merge statement
> 2.2 Usage of merge on two hash partitioned tables wit partition wide join 
> enabled
> 3. Usage of merge by batch of 1000 rows
> 
> First remark is the merge statement is almost 30% faster than two statements 
> in my benchmarks. Thanks to the pg community for this.
> 
> While the strategies 1 and 2.x are incredibly slow (canceled after 10 hours), 
> the third one finishes within 30 minutes.
>


Seems pretty terrible, provided the data is on reasonable storage (with
acceptable random I/O behavior).

> My interpretation reading the query plan is: well sized small batches of 
> upserts leverage the indexes while the regular join choose the sequential 
> scan, including sorting and hashing which takes forever time and resources 
> including disk.

You may be right, but it's hard to tell without seeing the query plan.

> 
> Sadly my incoming dataset is too small to benefit from a seq scan and too 
> large to benefit from an index scan join. However when splited manuallyin N 
> portions, the problem can be tackled with N * small cost, which is cheap 
> anyway.
> 

Sounds very much like you'd benefit from tuning some cost parameters to
make the index scan look cheaper.

> Questions:
> 1. Is there another strategy  ?
> 2. Could postgres support a "batched indexed join itself", leveraging indexes 
> itself by dynamic sized batches ?
> 

Not sure what 'batched indexed join' would be, but it very much sounds
like a nested loop with an index scan.

> 
> It is error prone write code to split and iterate  I suspect postgres has 
> everything internally (indexes catalog, planner) to split itself the job, 
> making David vs Goliath something trivial.
> 

What PostgreSQL version are you using, what hardware? Did you tune it in
any way, or is everything just default?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Merge David and Goliath tables efficiently

Reply via email to