Lok:

On Sat, 3 Feb 2024 at 21:44, Lok P <loknath...@gmail.com> wrote:

> On Sun, Feb 4, 2024 at 12:50 AM Francisco Olarte <fola...@peoplecall.com> 
> wrote:

> However , as we have ~5billion rows in the base table and out of that , we 
> were expecting almost half i.e. ~2billion would be duplicates. And you said, 
> doing the inserts using the "order by CTID Offset" approach must cause one 
> full sequential scan of the whole table for loading each chunk/10M of rows 
> and that would take a long time I believe.

I did not say "MUST CAUSE". In fact I said I believe it would not.

What I am gonna say ( now ) is test it. Make 1k, 10k, 100k, 1M tables
in a scratch database, explain and test your things there w/ & w/o
index etc.. Not all needed, but testing 100k & 1M in 1k batches could
show you missing quadratic behaviour. Explain would show you
unexpected sorts or scans.

> I am still trying to understand the other approach which you suggested. Not 
> able to understand "you can select where index_col > last order by index_col 
> limit 10M," .
> However, to get the max ID value of the last 10M loaded rows in target, do 
> you say that having an PK index created on that target table column(ID) will 
> help, and we can save the max (ID) value subsequently in another table to 
> fetch and keep loading from the source table (as ID>Max_ID stored in temp 
> table)?

I am a programmer by trade. When faced with problems like these,
unless I find a trivial solution, I tend to make auxiliary programs as
it is much easier for me to make a thousand lines of Perl ( or
python/C/C++ Java ) or a couple hundreds of plpgsql ( if I want to
avoid roundtrips ) than trying to debug complicated SQL only
workflows.

For your kind of problem I would make a program to read the rows and
insert them. As an example, lets say you have a non-unique index on ID
and are going to use the on-conflict-do-nothing route ( I do believe
this would be slow due to the need of having an index on the target
table to support it, but let's assume it is ok ). To do that I may
just do a loop, starting with last_id=-1(some id less than any other
id), selecting a chunk of rows with id>=last_id ordered by id and
inserting them. After doing that I may notice that I do not need the
index if the sort order is right, drop the index and the on-conflict
and just do, for every row, if(id>last_id) insert before storing
last_id=id.

Anyway, not knowing the real table characteristics and current usage
patterns I cannot recomend anything concrete.

> Would it be better to do it in one shot only , but by setting a higher value 
> of some parameters like "maintenance_work_mem" or "max_parallel_workers"?

It depends on a lot of unknown ( to us ) things.

Francisco Olarte.


Reply via email to