Ron Johnson <ronljohnso...@gmail.com> 7:37 PM (1 hour ago) to *pgsql-general* On Sat, Feb 3, 2024 at 7:37 PM Ron Johnson <ronljohnso...@gmail.com> wrote:
> On Sat, Feb 3, 2024 at 8:55 AM Lok P <loknath...@gmail.com> wrote: > >> Apology. One correction, the query is like below. I. E filter will be on >> on ctid which I believe is equivalent of rowid in oracle and we will not >> need the index on Id column then. >> >> But, it still runs long, so thinking any other way to make the duplicate >> removal faster? >> >> Also wondering , the index creation which took ~2.5hrs+ , would that have >> been made faster any possible way by allowing more db resource through some >> session level db parameter setting? >> >> create table TAB1_New >> as >> SELECT * from TAB1 A >> where CTID in >> (select min(CTID) from TAB1 >> group by ID having count(ID)>=1 ); >> >> >> On Sat, Feb 3, 2024 at 5:50 PM Lok P <loknath...@gmail.com> wrote: >> >>> Hello All, >>> A non partitioned table having ~4.8 billion rows in it and having data >>> size as ~1.4TB , row size as ~313 bytes having ~127 columns in it. This has >>> got approx ~1billion+ duplicate rows inserted in it and we want to get the >>> duplicate data removed for this table and create a PK/unique constraint >>> back so as to not have the duplicate values in future. We are struggling to >>> do the same. >>> >>> Teammates suggested doing this using CTAS method, i.e. create a new >>> table with the unique record set and then drop the main table. Something as >>> below >>> >>> create table TAB1_New >>> as >>> SELECT * from TAB1 A >>> where ID in >>> (select min(ID) from TAB1 >>> group by ID having count(ID)>=1 ); >>> >>> But for the above to work faster , they mentioned to have an index >>> created on the column using which the duplicate check will be performed i.e >>> ID column. So, creating the index itself took ~2hrs+ and the index size now >>> shows as ~116GB. >>> >>> >>> *Create index idx1 on TAB1(ID)* >>> And now running the SELECT subquery part of the CTAS statement to see if >>> its giving the correct unique records count. It ran for 2.5 hrs and then we >>> killed it. Below is the plan for the same. >>> >>> explain >>> *select min(ID) from TAB1 A group by ID having count(ID)>=1* >>> >>> GroupAggregate (cost=0.71..6025790113.87 rows=29432861 width=46) >>> Group Key: ID >>> Filter: (count(ID) >= 1) >>> -> Index Only Scan using idx1 on TAB1 a (cost=0.71..5988060903.17 >>> rows=4883397120 width=14) >>> >>> I want to understand if by any way this can be done faster . Also I am >>> worried that creating PK constraint/index back after deleting the duplicate >>> is also going to run forever. Is there any way we can make these heavy >>> operations faster on postgre by facilitating more database resources >>> through some parameter setup, like parallel hint etc? We have pg_hint_plan >>> extension added, but not seeing the parallel hint enforced when adding it >>> to the query. >>> >>> In Oracle we have Parallel hints, Direct path read/write for faster >>> read/write operations, parallel index scan etc. available, if anything >>> similar to that available in aurora postgre to facilitate more >>> horsepower and speed up the batch operations. And , how can we monitor >>> progress of any running query ? >>> Just to note- It's a db.r7g.8xlarge aurora postgres instance, 32VCPU, >>> 256GB RAM. PG version 15.4. >>> >> > Aurora is not Postgresql, so configurations might not work. Having said > that... > https://www.postgresql.org/docs/15t/how-parallel-query-works.html > > And have you analyzed the table lately? Also, what's your work_mem > and maintenance_work_mem? > Thank you . Below are the values of the default parameters in this instance SHOW max_worker_processes; - 128 show max_parallel_workers_per_gather;- 4 show max_parallel_workers;- 32 show max_parallel_maintenance_workers; - 2 show maintenance_work_mem; - 4155MB show work_mem; - 8MB show shared_buffers ; -22029684 When I ran the CTAS queries and index creation process , I had not a very clear idea of how these are related to each other and help each of the operations, but I set a few of those as below before triggering those in the same session. set max_parallel_workers_per_gather=16; SET max_parallel_maintenance_workers TO 16; SET maintenance_work_mem TO '16 GB'; The instance has a total ~256 GB memory, so how should I adjust/bump these values when running heavy SELECT queries doing a large sequential scan OR large index creation process OR any Select query with heavy sorting/"order by" operations OR heavy JOINS? I have not analyzed the table manually though , but seeing the auto_vaccum and auto_analyze column getting populated in the pg_stat_user_tables , I thought it must be doing that automatically. By the way if we run "analyze tab1' on this 1.5TB table , will that run longer and will any of the above parameters help to expedite that ANALYZE operation too, if I run the ANALYZE manually? Regards Lok >