I copy/pasted your question into ChatGPT, and it gave me 10 specific suggestions. Have you tried those?
On Sat, Feb 3, 2024 at 10:55 AM Lok P <loknath...@gmail.com> wrote: > Apology. One correction, the query is like below. I. E filter will be on > on ctid which I believe is equivalent of rowid in oracle and we will not > need the index on Id column then. > > But, it still runs long, so thinking any other way to make the duplicate > removal faster? > > Also wondering , the index creation which took ~2.5hrs+ , would that have > been made faster any possible way by allowing more db resource through some > session level db parameter setting? > > create table TAB1_New > as > SELECT * from TAB1 A > where CTID in > (select min(CTID) from TAB1 > group by ID having count(ID)>=1 ); > > > On Sat, Feb 3, 2024 at 5:50 PM Lok P <loknath...@gmail.com> wrote: > >> Hello All, >> A non partitioned table having ~4.8 billion rows in it and having data >> size as ~1.4TB , row size as ~313 bytes having ~127 columns in it. This has >> got approx ~1billion+ duplicate rows inserted in it and we want to get the >> duplicate data removed for this table and create a PK/unique constraint >> back so as to not have the duplicate values in future. We are struggling to >> do the same. >> >> Teammates suggested doing this using CTAS method, i.e. create a new table >> with the unique record set and then drop the main table. Something as below >> >> create table TAB1_New >> as >> SELECT * from TAB1 A >> where ID in >> (select min(ID) from TAB1 >> group by ID having count(ID)>=1 ); >> >> But for the above to work faster , they mentioned to have an index >> created on the column using which the duplicate check will be performed i.e >> ID column. So, creating the index itself took ~2hrs+ and the index size now >> shows as ~116GB. >> >> >> *Create index idx1 on TAB1(ID)* >> And now running the SELECT subquery part of the CTAS statement to see if >> its giving the correct unique records count. It ran for 2.5 hrs and then we >> killed it. Below is the plan for the same. >> >> explain >> *select min(ID) from TAB1 A group by ID having count(ID)>=1* >> >> GroupAggregate (cost=0.71..6025790113.87 rows=29432861 width=46) >> Group Key: ID >> Filter: (count(ID) >= 1) >> -> Index Only Scan using idx1 on TAB1 a (cost=0.71..5988060903.17 >> rows=4883397120 width=14) >> >> I want to understand if by any way this can be done faster . Also I am >> worried that creating PK constraint/index back after deleting the duplicate >> is also going to run forever. Is there any way we can make these heavy >> operations faster on postgre by facilitating more database resources >> through some parameter setup, like parallel hint etc? We have pg_hint_plan >> extension added, but not seeing the parallel hint enforced when adding it >> to the query. >> >> In Oracle we have Parallel hints, Direct path read/write for faster >> read/write operations, parallel index scan etc. available, if anything >> similar to that available in aurora postgre to facilitate more >> horsepower and speed up the batch operations. And , how can we monitor >> progress of any running query ? >> Just to note- It's a db.r7g.8xlarge aurora postgres instance, 32VCPU, >> 256GB RAM. PG version 15.4. >> >> Regards >> Lok >> > -- Todd Lewis tle...@brickabode.com