On 6/17/23 15:48, Nicolas Paris wrote: > In my use case I have a 2billion / 1To table. I have daily data to upsert > around 2milion, with say 50% inserts, based on the primary key in a fresh > analyzed table. > > I have tested multiple strategies to merge the data, all based on first stage > to copy the 2m dataset in an staging unlogged / indexed table: > > 1. Join insert then join update > 2.1. Usage of the new merge statement > 2.2 Usage of merge on two hash partitioned tables wit partition wide join > enabled > 3. Usage of merge by batch of 1000 rows > > First remark is the merge statement is almost 30% faster than two statements > in my benchmarks. Thanks to the pg community for this. > > While the strategies 1 and 2.x are incredibly slow (canceled after 10 hours), > the third one finishes within 30 minutes. >
Seems pretty terrible, provided the data is on reasonable storage (with acceptable random I/O behavior). > My interpretation reading the query plan is: well sized small batches of > upserts leverage the indexes while the regular join choose the sequential > scan, including sorting and hashing which takes forever time and resources > including disk. You may be right, but it's hard to tell without seeing the query plan. > > Sadly my incoming dataset is too small to benefit from a seq scan and too > large to benefit from an index scan join. However when splited manuallyin N > portions, the problem can be tackled with N * small cost, which is cheap > anyway. > Sounds very much like you'd benefit from tuning some cost parameters to make the index scan look cheaper. > Questions: > 1. Is there another strategy ? > 2. Could postgres support a "batched indexed join itself", leveraging indexes > itself by dynamic sized batches ? > Not sure what 'batched indexed join' would be, but it very much sounds like a nested loop with an index scan. > > It is error prone write code to split and iterate I suspect postgres has > everything internally (indexes catalog, planner) to split itself the job, > making David vs Goliath something trivial. > What PostgreSQL version are you using, what hardware? Did you tune it in any way, or is everything just default? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company