Ron Johnson <ronljohnso...@gmail.com>
7:37 PM (1 hour ago)
to *pgsql-general*
On Sat, Feb 3, 2024 at 7:37 PM Ron Johnson <ronljohnso...@gmail.com> wrote:

> On Sat, Feb 3, 2024 at 8:55 AM Lok P <loknath...@gmail.com> wrote:
>
>> Apology. One correction, the query is like below. I. E filter will be on
>> on ctid which I believe is equivalent of rowid in oracle and we will not
>> need the index on Id column then.
>>
>>  But, it still runs long, so thinking any other way to make the duplicate
>> removal faster?
>>
>> Also wondering , the index creation which took ~2.5hrs+ , would that have
>> been made faster any possible way by allowing more db resource through some
>> session level db parameter setting?
>>
>> create table TAB1_New
>> as
>> SELECT  * from TAB1 A
>> where CTID in
>>       (select min(CTID) from TAB1
>>       group by ID having count(ID)>=1 );
>>
>>
>> On Sat, Feb 3, 2024 at 5:50 PM Lok P <loknath...@gmail.com> wrote:
>>
>>> Hello All,
>>> A non partitioned table having ~4.8 billion rows in it and having data
>>> size as ~1.4TB , row size as ~313 bytes having ~127 columns in it. This has
>>> got approx ~1billion+ duplicate rows inserted in it and we want to get the
>>> duplicate data removed for this table and create a PK/unique constraint
>>> back so as to not have the duplicate values in future. We are struggling to
>>> do the same.
>>>
>>> Teammates suggested doing this using CTAS method, i.e. create a new
>>> table with the unique record set and then drop the main table. Something as
>>> below
>>>
>>> create table TAB1_New
>>> as
>>> SELECT  * from TAB1 A
>>> where ID in
>>>       (select min(ID) from TAB1
>>>       group by ID having count(ID)>=1 );
>>>
>>> But for the above to work faster , they mentioned to have an index
>>> created on the column using which the duplicate check will be performed i.e
>>> ID column. So, creating the index itself took ~2hrs+ and the index size now
>>> shows as ~116GB.
>>>
>>>
>>> *Create index idx1 on TAB1(ID)*
>>> And now running the SELECT subquery part of the CTAS statement to see if
>>> its giving the correct unique records count. It ran for 2.5 hrs and then we
>>> killed it. Below is the plan for the same.
>>>
>>> explain
>>> *select  min(ID) from TAB1 A        group by ID having count(ID)>=1*
>>>
>>> GroupAggregate  (cost=0.71..6025790113.87 rows=29432861 width=46)
>>>   Group Key: ID
>>>   Filter: (count(ID) >= 1)
>>>   ->  Index Only Scan using idx1 on TAB1 a  (cost=0.71..5988060903.17
>>> rows=4883397120 width=14)
>>>
>>> I want to understand if by any way this can be done faster . Also I am
>>> worried that creating PK constraint/index back after deleting the duplicate
>>> is also going to run forever. Is there any way we can make these heavy
>>> operations faster on postgre by facilitating more database resources
>>> through some parameter setup, like parallel hint etc? We have pg_hint_plan
>>> extension added, but not seeing the parallel hint enforced when adding it
>>> to the query.
>>>
>>> In Oracle we have Parallel hints, Direct path read/write for faster
>>> read/write operations, parallel index scan etc. available, if anything
>>> similar to that available in aurora postgre to facilitate more
>>> horsepower and speed up the batch operations. And , how can we monitor
>>> progress of any running query ?
>>>  Just to note- It's a db.r7g.8xlarge aurora postgres instance, 32VCPU,
>>> 256GB RAM. PG version 15.4.
>>>
>>
>  Aurora is not Postgresql, so configurations might not work.  Having said
> that...
> https://www.postgresql.org/docs/15t/how-parallel-query-works.html
>
> And have you analyzed the table lately?  Also, what's your work_mem
> and maintenance_work_mem?
>

 Thank you .

Below are the values of the default parameters in this instance

SHOW max_worker_processes; - 128
show max_parallel_workers_per_gather;- 4
show max_parallel_workers;- 32
show max_parallel_maintenance_workers; - 2
show maintenance_work_mem; - 4155MB
show work_mem; - 8MB
show shared_buffers ; -22029684

When I ran the CTAS queries and index creation process , I had not a very
clear idea of how these are related to each other and help each of the
operations,  but I set a few of those as below before triggering those in
the same session.

set max_parallel_workers_per_gather=16;
SET max_parallel_maintenance_workers TO 16;
SET maintenance_work_mem TO '16 GB';

The instance has a total ~256 GB memory, so how should I adjust/bump these
values when running heavy SELECT queries doing a large sequential scan  OR
large index creation process OR any Select query with heavy sorting/"order
by" operations OR heavy JOINS?

I have not analyzed the table manually though , but seeing the auto_vaccum
and auto_analyze column getting populated in the pg_stat_user_tables , I
thought it must be doing that automatically.

By the way if we run "analyze tab1' on this 1.5TB table , will that run
longer and will any of the above parameters help to expedite that ANALYZE
operation too, if I run the ANALYZE manually?

Regards
Lok

>

Reply via email to