Lok: On Sat, 3 Feb 2024 at 21:44, Lok P <loknath...@gmail.com> wrote:
> On Sun, Feb 4, 2024 at 12:50 AM Francisco Olarte <fola...@peoplecall.com> > wrote: > However , as we have ~5billion rows in the base table and out of that , we > were expecting almost half i.e. ~2billion would be duplicates. And you said, > doing the inserts using the "order by CTID Offset" approach must cause one > full sequential scan of the whole table for loading each chunk/10M of rows > and that would take a long time I believe. I did not say "MUST CAUSE". In fact I said I believe it would not. What I am gonna say ( now ) is test it. Make 1k, 10k, 100k, 1M tables in a scratch database, explain and test your things there w/ & w/o index etc.. Not all needed, but testing 100k & 1M in 1k batches could show you missing quadratic behaviour. Explain would show you unexpected sorts or scans. > I am still trying to understand the other approach which you suggested. Not > able to understand "you can select where index_col > last order by index_col > limit 10M," . > However, to get the max ID value of the last 10M loaded rows in target, do > you say that having an PK index created on that target table column(ID) will > help, and we can save the max (ID) value subsequently in another table to > fetch and keep loading from the source table (as ID>Max_ID stored in temp > table)? I am a programmer by trade. When faced with problems like these, unless I find a trivial solution, I tend to make auxiliary programs as it is much easier for me to make a thousand lines of Perl ( or python/C/C++ Java ) or a couple hundreds of plpgsql ( if I want to avoid roundtrips ) than trying to debug complicated SQL only workflows. For your kind of problem I would make a program to read the rows and insert them. As an example, lets say you have a non-unique index on ID and are going to use the on-conflict-do-nothing route ( I do believe this would be slow due to the need of having an index on the target table to support it, but let's assume it is ok ). To do that I may just do a loop, starting with last_id=-1(some id less than any other id), selecting a chunk of rows with id>=last_id ordered by id and inserting them. After doing that I may notice that I do not need the index if the sort order is right, drop the index and the on-conflict and just do, for every row, if(id>last_id) insert before storing last_id=id. Anyway, not knowing the real table characteristics and current usage patterns I cannot recomend anything concrete. > Would it be better to do it in one shot only , but by setting a higher value > of some parameters like "maintenance_work_mem" or "max_parallel_workers"? It depends on a lot of unknown ( to us ) things. Francisco Olarte.