Re: Postgres Logical Replication - how to see what subscriber is doing with received data?

2024-09-02 Thread Shaheed Haque
Hi Muhammad,

On Mon, 2 Sep 2024, 07:08 Muhammad Ikram,  wrote:

> Hi Shaheed,
>
> Maybe these considerations could help you or give any hint to the problem ?
>
>
> Check if wal_receiver_timeout being set to 0 could potentially cause
> issues, like not detecting network issues quickly enough. Consider
> re-evaluating this setting if you see connection issues.
>
> If you notice that some data is missing on subscriber then could you
> increase max_slot_wal_keep_size on publisher so that WALs are not deleted
> until they are applied on subscriber.
>
> Do you have flexibility to increase max_worker_processes and
> max_logical_replication_workers, work_mem and maintenance_work_mem on
> subscriber (In case bottleneck exists on subscriber)
>
> If there's significant lag, consider whether it might be more efficient to
> drop the subscription and re-initialize it from scratch using a new base
> backup, depending on the data volume and how long it might take for the
> existing replication to catch up.
>

Thanks for the kind hints, I'll certainly look into those.

My main interest however was with the "visibility" question, i.e. to get an
understanding of the gap between the two ends of a replication slot,
ideally in human terms (e.g. tables x records).

I understand the difficulties of trying to produce a meaningful metric that
spans two (or more) systems but let's be honest, trying to diagnose which
knobs to tweak (whether in application, PG, the OS or the network) is
basically black magic when all we really have is a pair of opaque LSNs.




>
>  Regards,
> Muhammad Ikram
>
>
> On Sun, Sep 1, 2024 at 9:22 PM Shaheed Haque 
> wrote:
>
>> Since nobody more knowledgeable has replied...
>>
>> I'm very interested in this area and still surprised that there is no
>> official/convenient/standard way to approach this (see
>> https://www.postgresql.org/message-id/CAHAc2jdAHvp7tFZBP37awcth%3DT3h5WXCN9KjZOvuTNJaAAC_hg%40mail.gmail.com
>> ).
>>
>> Based partly on that thread, I ended up with a script that connects to
>> both ends of the replication, and basically loops while comparing the
>> counts in each table.
>>
>> On Fri, 30 Aug 2024, 12:38 Michael Jaskiewicz, 
>> wrote:
>>
>>> I've got two Postgres 13 databases on AWS RDS.
>>>
>>>- One is a master, the other a slave using logical replication.
>>>- Replication has fallen behind by about 350Gb.
>>>- The slave was maxed out in terms of CPU for the past four days
>>>because of some jobs that were ongoing so I'm not sure what logical
>>>replication was able to replicate during that time.
>>>- I killed those jobs and now CPU on the master and slave are both
>>>low.
>>>- I look at the subscriber via `select * from pg_stat_subscription;`
>>>and see that latest_end_lsn is advancing albeit very slowly.
>>>- The publisher says write/flush/replay lags are all 13 minutes
>>>behind but it's been like that for most of the day.
>>>- I see no errors in the logs on either the publisher or subscriber
>>>outside of some simple SQL errors that users have been making.
>>>- CloudWatch reports low CPU utilization, low I/O, and low network.
>>>
>>>
>>>
>>> Is there anything I can do here? Previously I set wal_receiver_timeout
>>> timeout to 0 because I had replication issues, and that helped things. I
>>> wish I had *some* visibility here to get any kind of confidence that
>>> it's going to pull through, but other than these lsn values and database
>>> logs, I'm not sure what to check.
>>>
>>>
>>>
>>> Sincerely,
>>>
>>> mj
>>>
>>
>
> --
> Muhammad Ikram
>
>


Re: Postgres Logical Replication - how to see what subscriber is doing with received data?

2024-09-02 Thread Muhammad Ikram
Hi Shaheed,
I think you must have already analyzed the outcome of queries
on pg_replication_slots,  pg_current_wal_lsn(), pg_stat_subscription etc. I
could find a query SELECT
pg_size_pretty(pg_wal_lsn_diff('',
''));

As a side note if you want to see what has been applied to subscribers vs
what exists on publisher then here is something from my previous
experience. We used to have a Data Validation tool for checking tables/rows
across publisher/subscriber. We also used pg_dump for another tool that was
meant for making copies of schemas.

Regards,
Muhammad Ikram



On Mon, Sep 2, 2024 at 12:42 PM Shaheed Haque 
wrote:

> Hi Muhammad,
>
> On Mon, 2 Sep 2024, 07:08 Muhammad Ikram,  wrote:
>
>> Hi Shaheed,
>>
>> Maybe these considerations could help you or give any hint to the problem
>> ?
>>
>>
>> Check if wal_receiver_timeout being set to 0 could potentially cause
>> issues, like not detecting network issues quickly enough. Consider
>> re-evaluating this setting if you see connection issues.
>>
>> If you notice that some data is missing on subscriber then could you
>> increase max_slot_wal_keep_size on publisher so that WALs are not deleted
>> until they are applied on subscriber.
>>
>> Do you have flexibility to increase max_worker_processes and
>> max_logical_replication_workers, work_mem and maintenance_work_mem on
>> subscriber (In case bottleneck exists on subscriber)
>>
>> If there's significant lag, consider whether it might be more efficient
>> to drop the subscription and re-initialize it from scratch using a new base
>> backup, depending on the data volume and how long it might take for the
>> existing replication to catch up.
>>
>
> Thanks for the kind hints, I'll certainly look into those.
>
> My main interest however was with the "visibility" question, i.e. to get
> an understanding of the gap between the two ends of a replication slot,
> ideally in human terms (e.g. tables x records).
>
> I understand the difficulties of trying to produce a meaningful metric
> that spans two (or more) systems but let's be honest, trying to diagnose
> which knobs to tweak (whether in application, PG, the OS or the network) is
> basically black magic when all we really have is a pair of opaque LSNs.
>
>
>
>
>>
>>  Regards,
>> Muhammad Ikram
>>
>>
>> On Sun, Sep 1, 2024 at 9:22 PM Shaheed Haque 
>> wrote:
>>
>>> Since nobody more knowledgeable has replied...
>>>
>>> I'm very interested in this area and still surprised that there is no
>>> official/convenient/standard way to approach this (see
>>> https://www.postgresql.org/message-id/CAHAc2jdAHvp7tFZBP37awcth%3DT3h5WXCN9KjZOvuTNJaAAC_hg%40mail.gmail.com
>>> ).
>>>
>>> Based partly on that thread, I ended up with a script that connects to
>>> both ends of the replication, and basically loops while comparing the
>>> counts in each table.
>>>
>>> On Fri, 30 Aug 2024, 12:38 Michael Jaskiewicz, 
>>> wrote:
>>>
 I've got two Postgres 13 databases on AWS RDS.

- One is a master, the other a slave using logical replication.
- Replication has fallen behind by about 350Gb.
- The slave was maxed out in terms of CPU for the past four days
because of some jobs that were ongoing so I'm not sure what logical
replication was able to replicate during that time.
- I killed those jobs and now CPU on the master and slave are both
low.
- I look at the subscriber via `select * from
pg_stat_subscription;` and see that latest_end_lsn is advancing albeit 
 very
slowly.
- The publisher says write/flush/replay lags are all 13 minutes
behind but it's been like that for most of the day.
- I see no errors in the logs on either the publisher or subscriber
outside of some simple SQL errors that users have been making.
- CloudWatch reports low CPU utilization, low I/O, and low network.



 Is there anything I can do here? Previously I set wal_receiver_timeout
 timeout to 0 because I had replication issues, and that helped things. I
 wish I had *some* visibility here to get any kind of confidence that
 it's going to pull through, but other than these lsn values and database
 logs, I'm not sure what to check.



 Sincerely,

 mj

>>>
>>
>> --
>> Muhammad Ikram
>>
>>

-- 
Muhammad Ikram


Re: Postgres Logical Replication - how to see what subscriber is doing with received data?

2024-09-02 Thread Shaheed Haque
Hi Muhammad,

On Mon, 2 Sep 2024, 09:45 Muhammad Ikram,  wrote:

> Hi Shaheed,
> I think you must have already analyzed the outcome of queries
> on pg_replication_slots,  pg_current_wal_lsn(), pg_stat_subscription etc. I
> could find a query SELECT
> pg_size_pretty(pg_wal_lsn_diff('',
> ''));
>

Yes. My point is that it is hard to go from byte numbers to table entries.

Aps a side note if you want to see what has been applied to subscribers vs
> what exists on publisher then here is something from my previous
> experience. We used to have a Data Validation tool for checking tables/rows
> across publisher/subscriber.
>

Ack. That's pretty much what I had to build.

We also used pg_dump for another tool that was meant for making copies of
> schemas.
>

I'm somewhat fortunate to have a simple use case where all I am doing is a
copy of the "old" deployment to a "new" deployment such that when the two
ends are in close sync, I can freeze traffic to the old deployment, pause
for any final catchup, and then run a Django migration on the new, before
switching on the new (thereby minimising the down time for the app).

What I found by just looking at LSN numbers was that the database LSN were
close but NOT the same. Once I built the tool, I was able to see which
tables were still in play, and saw that some previously overlooked
background timers were expiring, causing the activity.

Net result: the LSNs can tell you if you are not in sync, but not the
reason why. (Again, I understand that row counts worked for me, but might
not work for others).

Thanks for your kind help and pointers!


Regards,
> Muhammad Ikram
>
>
>
> On Mon, Sep 2, 2024 at 12:42 PM Shaheed Haque 
> wrote:
>
>> Hi Muhammad,
>>
>> On Mon, 2 Sep 2024, 07:08 Muhammad Ikram,  wrote:
>>
>>> Hi Shaheed,
>>>
>>> Maybe these considerations could help you or give any hint to the
>>> problem ?
>>>
>>>
>>> Check if wal_receiver_timeout being set to 0 could potentially cause
>>> issues, like not detecting network issues quickly enough. Consider
>>> re-evaluating this setting if you see connection issues.
>>>
>>> If you notice that some data is missing on subscriber then could you
>>> increase max_slot_wal_keep_size on publisher so that WALs are not deleted
>>> until they are applied on subscriber.
>>>
>>> Do you have flexibility to increase max_worker_processes and
>>> max_logical_replication_workers, work_mem and maintenance_work_mem on
>>> subscriber (In case bottleneck exists on subscriber)
>>>
>>> If there's significant lag, consider whether it might be more efficient
>>> to drop the subscription and re-initialize it from scratch using a new base
>>> backup, depending on the data volume and how long it might take for the
>>> existing replication to catch up.
>>>
>>
>> Thanks for the kind hints, I'll certainly look into those.
>>
>> My main interest however was with the "visibility" question, i.e. to get
>> an understanding of the gap between the two ends of a replication slot,
>> ideally in human terms (e.g. tables x records).
>>
>> I understand the difficulties of trying to produce a meaningful metric
>> that spans two (or more) systems but let's be honest, trying to diagnose
>> which knobs to tweak (whether in application, PG, the OS or the network) is
>> basically black magic when all we really have is a pair of opaque LSNs.
>>
>>
>>
>>
>>>
>>>  Regards,
>>> Muhammad Ikram
>>>
>>>
>>> On Sun, Sep 1, 2024 at 9:22 PM Shaheed Haque 
>>> wrote:
>>>
 Since nobody more knowledgeable has replied...

 I'm very interested in this area and still surprised that there is no
 official/convenient/standard way to approach this (see
 https://www.postgresql.org/message-id/CAHAc2jdAHvp7tFZBP37awcth%3DT3h5WXCN9KjZOvuTNJaAAC_hg%40mail.gmail.com
 ).

 Based partly on that thread, I ended up with a script that connects to
 both ends of the replication, and basically loops while comparing the
 counts in each table.

 On Fri, 30 Aug 2024, 12:38 Michael Jaskiewicz, 
 wrote:

> I've got two Postgres 13 databases on AWS RDS.
>
>- One is a master, the other a slave using logical replication.
>- Replication has fallen behind by about 350Gb.
>- The slave was maxed out in terms of CPU for the past four days
>because of some jobs that were ongoing so I'm not sure what logical
>replication was able to replicate during that time.
>- I killed those jobs and now CPU on the master and slave are both
>low.
>- I look at the subscriber via `select * from
>pg_stat_subscription;` and see that latest_end_lsn is advancing albeit 
> very
>slowly.
>- The publisher says write/flush/replay lags are all 13 minutes
>behind but it's been like that for most of the day.
>- I see no errors in the logs on either the publisher or
>subscriber outside of some simple SQL errors that users have been 
> making.
>- CloudWa

Re: Partitioning and unique key

2024-09-02 Thread Laurenz Albe
On Sun, 2024-09-01 at 01:32 +0530, veem v wrote:
> due to postgres limitations we are unable to have this unique constraint or 
> primary key
> only on the transaction_id column, we have to include transaction_timestamp 
> with it as
> a composite key. So I want to understand from experts if there is any 
> possible way to
> satisfy both partitioning on transaction_timestamp column and unique key or 
> pk just on
> transaction_id only? 

No, you cannot have both.

Usually the solution is to *not* create a primary key on the partitioned table
and instead create a primary key on each partition.

That won't guarantee global uniqueness (and there is no way to do that), but it
goes a long way by ensuring that the column is unique within each partition.

Yours,
Laurenz Albe




Re: PG17 optimizations to vacuum

2024-09-02 Thread Peter Geoghegan
On Mon, Sep 2, 2024 at 4:58 PM Peter Geoghegan  wrote:
> On Mon, Sep 2, 2024 at 4:35 PM Pavel Luzanov  wrote:
> > If it helps, without creating index on id column, the numbers will be
> > much closer:
>
> Yes, avoiding all index vacuuming seems useful. It makes the test case
> cleaner, since we don't have to think about the variability from the
> TIDStore work (and from index vacuuming more generally).

It just occurred to me that earlier versions don't have the
HEAP_PAGE_PRUNE_MARK_UNUSED_NOW optimization added by commit
c120550edb. Postgres 17 does have that optimization, though, so it
should easily be able to write far fewer WAL records than earlier
versions. And yet your revised no-indexes test case seems to show that
Postgres 17 is doing slightly worse by that measure (and by others).

-- 
Peter Geoghegan




Re: PG17 optimizations to vacuum

2024-09-02 Thread Heikki Linnakangas

On 03/09/2024 00:11, Heikki Linnakangas wrote:

On 03/09/2024 00:01, Peter Geoghegan wrote:
On Mon, Sep 2, 2024 at 4:58 PM Heikki Linnakangas  
wrote:

Do you have any non-default settings? "select name,
current_setting(name), source  from pg_settings where setting <>
boot_val;" would show that.


What about page checksums?

One simple explanation is that we're writing extra FPIs to set hint
bits. But that explanation only works if you assume that page-level
checksums are in use (or that wal_log_hints is turned on).


Hmm, yeah, that might be it. With chceksums enabled, I see ~120k WAL 
records, vs ~90k without checksums. But there's no difference between 
v16 and master.


Looking at the pg_waldump output from this test:

... > rmgr: XLOGlen (rec/tot): 49/  8209, tx:  0, lsn: 
0/FE052AA8, prev 0/FE0528A8, desc: FPI_FOR_HINT , blkref #0: rel 
1663/5/16396 blk 73 FPW

rmgr: Heap2   len (rec/tot):507/   507, tx:  0, lsn: 
0/FE054AD8, prev 0/FE052AA8, desc: PRUNE snapshotConflictHorizon: 754, 
nredirected: 0, ndead: 226, nunused: 0, redirected: [], dead: [1, 2, 3, 4, 5, 
6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 
27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 
47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 
67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 
87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 
105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 
121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 
137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 
153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 
169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 
185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 
201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 
217, 218, 219, 220, 221, 222, 223, 224, 225, 226], unused: [], blkref #0: rel 
1663/5/16396 blk 73
rmgr: XLOGlen (rec/tot): 49/  8209, tx:  0, lsn: 
0/FE054CD8, prev 0/FE054AD8, desc: FPI_FOR_HINT , blkref #0: rel 1663/5/16396 
blk 74 FPW
rmgr: Heap2   len (rec/tot):507/   507, tx:  0, lsn: 
0/FE056D08, prev 0/FE054CD8, desc: PRUNE snapshotConflictHorizon: 754, 
nredirected: 0, ndead: 226, nunused: 0, redirected: [], dead: [1, 2, 3, 4, 5, 
6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 
27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 
47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 
67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 
87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 
105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 
121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 
137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 
153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 
169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 
185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 
201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 
217, 218, 219, 220, 221, 222, 223, 224, 225, 226], unused: [], blkref #0: rel 
1663/5/16396 blk 74
...


The pattern of WAL records with checksums enabled is silly: For each 
page, we first write an FPI record, an immediately after that a PRUNE 
record that removes all the tuples on it, leaving the page empty.


This is the same with v16 and v17, but we certainly left money on the 
table by not folding that FPI into the VACUUM/PRUNE record.


--
Heikki Linnakangas
Neon (https://neon.tech)



Re: Partitioning and unique key

2024-09-02 Thread veem v
On Mon, 2 Sept 2024 at 19:13, Laurenz Albe  wrote:

> On Sun, 2024-09-01 at 01:32 +0530, veem v wrote:
> > due to postgres limitations we are unable to have this unique constraint
> or primary key
> > only on the transaction_id column, we have to include
> transaction_timestamp with it as
> > a composite key. So I want to understand from experts if there is any
> possible way to
> > satisfy both partitioning on transaction_timestamp column and unique key
> or pk just on
> > transaction_id only?
>
> No, you cannot have both.
>
> Usually the solution is to *not* create a primary key on the partitioned
> table
> and instead create a primary key on each partition.
>
> That won't guarantee global uniqueness (and there is no way to do that),
> but it
> goes a long way by ensuring that the column is unique within each
> partition.
>
> Yours,
> Laurenz Albe
>

Thank you so much.

So it means in our case the existing PK on table level on column (txn_id
and txn_timestamp), we should drop that and create a unique index on each
partition level and also the same way the foreign key also maps to the
parent table partitions. And in that case , can we follow this as best
practices to not have the primary keys defined at the tabe level at all,
but on the partition level only, or there exist any down side to it too?
Please suggest.

Also then what I see is, it will make the data load query fail which uses
"insert on conflict" to insert data into the table and that requires the
primary key on both the columns to have on table level. Also the partition
maintenance job which uses partman extension uses the template table which
in turn uses table level properties for creating new partitions and they
will not have these unique indexes created for the new partitions as
because the unique index property is not on the table level but partition
level. Can you share your thoughts on these?


Re: PG17 optimizations to vacuum

2024-09-02 Thread Melanie Plageman
On Sun, Sep 1, 2024 at 6:00 PM Peter Geoghegan  wrote:
>
> On Sun, Sep 1, 2024 at 5:44 PM Pavel Luzanov  wrote:
> > I see a perfectly working TID-store optimization.
> > With reduced maintenance_work_mem it used only one 'vacuuming indexes'
> > phase instead of 21 in v16.
> > But I also expected to see a reduction in the number of WAL records
> > and the total size of the WAL. Instead, WAL numbers have significantly
> > degraded.
> >
> > What am I doing wrong?

I'll investigate more tomorrow, but based on my initial investigation,
there appears to be some interaction related to how much of the
relation is in shared buffers after creating the table and updating
it. If you set shared_buffers sufficiently high and prewarm the table
after the update, master has fewer WAL records reported by vacuum
verbose.

- Melanie




Re: PG17 optimizations to vacuum

2024-09-02 Thread Peter Geoghegan
On Mon, Sep 2, 2024 at 1:29 PM Melanie Plageman
 wrote:
> I'll investigate more tomorrow, but based on my initial investigation,
> there appears to be some interaction related to how much of the
> relation is in shared buffers after creating the table and updating
> it. If you set shared_buffers sufficiently high and prewarm the table
> after the update, master has fewer WAL records reported by vacuum
> verbose.

Fewer of what specific kind of WAL record?

All of the details about useful work done by VACUUM were identical
across versions. It was only the details related to WAL, buffers, and
CPU time that changed.

Perhaps I'm not thinking of something obvious. Maybe it's extra
VISIBILITY records? But I'd expect the number of VISIBILITY records to
match the number of pages frozen, given these particulars. VACUUM
VERBOSE at least shows that that hasn't changed.

-- 
Peter Geoghegan




Re: PG17 optimizations to vacuum

2024-09-02 Thread Melanie Plageman
On Mon, Sep 2, 2024 at 1:47 PM Peter Geoghegan  wrote:
>
> On Mon, Sep 2, 2024 at 1:29 PM Melanie Plageman
>  wrote:
> > I'll investigate more tomorrow, but based on my initial investigation,
> > there appears to be some interaction related to how much of the
> > relation is in shared buffers after creating the table and updating
> > it. If you set shared_buffers sufficiently high and prewarm the table
> > after the update, master has fewer WAL records reported by vacuum
> > verbose.
>
> Fewer of what specific kind of WAL record?

I would have expected to see no freeze records (since they no longer
exist) and the same number of prune records. However, the overall
number of records that I get for 16 and master is pretty similar. For
some reason I stopped being able to reproduce Pavel's case. I'll work
more on it tomorrow.

This is roughly what I get for records by vacuum. Note that I prefixed
VACUUM with BTREE on master to indicate those records are from index
vacuuming. By default the headesc routine for records emitted by index
vacuuming prints just VACUUM -- perhaps it would be better to prefix
it.

Note that these add up to almost the same thing. I don't know yet why
the number PRUNE_VACUUM_SCAN is different than PRUNE on 16.
PRUNE_VACUUM_SCAN and PRUNE + FREEZE_PAGE on 16 are similar. So, there
must be pages that don't have items being pruned which are being
frozen. I'll need to investigate further.

master
--
 PRUNE_ON_ACCESS | 6
 PRUNE_VACUUM_SCAN| 30974
 PRUNE_VACUUM_CLEANUP | 14162
 BTREE_VACUUM| 19127


16
--
 PRUNE   | 15504
 FREEZE_PAGE  | 13257
 VACUUM | 34527


- Melanie




Re: PG17 optimizations to vacuum

2024-09-02 Thread Peter Geoghegan
On Mon, Sep 2, 2024 at 3:23 PM Melanie Plageman
 wrote:
> This is roughly what I get for records by vacuum. Note that I prefixed
> VACUUM with BTREE on master to indicate those records are from index
> vacuuming. By default the headesc routine for records emitted by index
> vacuuming prints just VACUUM -- perhaps it would be better to prefix
> it.
>
> Note that these add up to almost the same thing. I don't know yet why
> the number PRUNE_VACUUM_SCAN is different than PRUNE on 16.

That is indeed surprising, given that Pavel's VACUUM VERBOSE output
indicates that the number of heap tuples deleted is identical across
versions. The output also strongly suggests that the same heap pages
are pruned on both versions, since the "index scan needed: " line is
also identical across versions.

Might it be that the extra PRUNE_VACUUM_SCAN records originated in
pages that only contained existing LP_UNUSED items when scanned by
VACUUM?

-- 
Peter Geoghegan




Re: Partitioning and unique key

2024-09-02 Thread Laurenz Albe
On Mon, 2024-09-02 at 21:39 +0530, veem v wrote:
> On Mon, 2 Sept 2024 at 19:13, Laurenz Albe  wrote:
> > On Sun, 2024-09-01 at 01:32 +0530, veem v wrote:
> > > due to postgres limitations we are unable to have this unique constraint 
> > > or primary key
> > > only on the transaction_id column, we have to include 
> > > transaction_timestamp with it as
> > > a composite key. So I want to understand from experts if there is any 
> > > possible way to
> > > satisfy both partitioning on transaction_timestamp column and unique key 
> > > or pk just on
> > > transaction_id only? 
> > 
> > No, you cannot have both.
> > 
> > Usually the solution is to *not* create a primary key on the partitioned 
> > table
> > and instead create a primary key on each partition.
> > 
> > That won't guarantee global uniqueness (and there is no way to do that), 
> > but it
> > goes a long way by ensuring that the column is unique within each partition.
> 
> So it means in our case the existing PK on table level on column (txn_id and 
> txn_timestamp),
> we should drop that and create a unique index on each partition level and 
> also the same way
> the foreign key also maps to the parent table partitions. And in that case , 
> can we follow
> this as best practices to not have the primary keys defined at the tabe level 
> at all, but
> on the partition level only, or there exist any down side to it too? Please 
> suggest.

You can keep the primary key defined on both columns if it is good enough for 
you.
But it will give you lower guarantees of uniqueness: with that primary key, 
there could
be two rows with a different timestamp, but the same "txn_id", and these two 
rows could
be in the same partition...

Also, if you need a foreign key pointing *to* the partitioned table, you cannot 
do without
a primary key.  But I recommend that you do *not* define such foreign keys: 
they will make
it more difficult to detach a partition.

If you partition two tables in the same way, you can use foreign keys between 
the partitions
instead of foreign keys between the partitioned tables.  Such foreign keys 
won't be a problem.

> Also then what I see is, it will make the data load query fail which uses 
> "insert on conflict"
> to insert data into the table and that requires the primary key on both the 
> columns to have
> on table level.

Yes, that is true.  A disadvantage of not having a unique constraint on the 
partitioned table.

> Also the partition maintenance job which uses partman extension uses the 
> template table which
> in turn uses table level properties for creating new partitions and they will 
> not have these
> unique indexes created for the new partitions as because the unique index 
> property is not on
> the table level but partition level. Can you share your thoughts on these?

Don't use partman.  Or if you do, create the primary key yourself, after 
partman has created
the partition.
I wouldn't let the limitations of a tool govern my design choices.

Yours,
Laurenz Albe




Re: PG17 optimizations to vacuum

2024-09-02 Thread Pavel Luzanov

On 02.09.2024 22:23, Melanie Plageman wrote:

For some reason I stopped being able to reproduce Pavel's case.


I repeated the test on another computer, but compared master with v15.
The results are the same. The test can be simplified as follows:

CREATE TABLE t(id integer) WITH (autovacuum_enabled = off);
INSERT INTO t SELECT gen.id FROM generate_series(1,3_500_000) gen(id);
CREATE INDEX t_id ON t(id);
VACUUM FREEZE t;
UPDATE t SET id = id + 1;
VACUUM FREEZE VERBOSE t;

My results (only line with WAL info from the last VACUUM command).

master:
WAL usage: 119583 records, 37231 full page images, 272631468 bytes

v15:
WAL usage: 96565 records, 47647 full page images, 217144602 bytes


If it helps, without creating index on id column, the numbers will be
much closer:

master:
WAL usage: 78502 records, 22090 full page images, 196215494 bytes

v15:
WAL usage: 77437 records, 30872 full page images, 152080268 bytes

--
Pavel Luzanov
Postgres Professional:https://postgrespro.com


Re: PG17 optimizations to vacuum

2024-09-02 Thread Heikki Linnakangas

On 02/09/2024 23:35, Pavel Luzanov wrote:

On 02.09.2024 22:23, Melanie Plageman wrote:

For some reason I stopped being able to reproduce Pavel's case.


I also cannot reproduce this.


I repeated the test on another computer, but compared master with v15.
The results are the same. The test can be simplified as follows:

CREATE TABLE t(id integer) WITH (autovacuum_enabled = off);
INSERT INTO t SELECT gen.id FROM generate_series(1,3_500_000) gen(id);
CREATE INDEX t_id ON t(id);
VACUUM FREEZE t;
UPDATE t SET id = id + 1;
VACUUM FREEZE VERBOSE t;

My results (only line with WAL info from the last VACUUM command).

master:
WAL usage: 119583 records, 37231 full page images, 272631468 bytes

v15:
WAL usage: 96565 records, 47647 full page images, 217144602 bytes


Can you dump the stats with pg_waldump please. Something like:

CREATE TABLE t(id integer) WITH (autovacuum_enabled = off);
INSERT INTO t SELECT gen.id FROM generate_series(1,3_500_000) gen(id);
CREATE INDEX t_id ON t(id);
SET maintenance_work_mem = '1MB';
UPDATE t SET id = id + 1;
select pg_current_wal_insert_lsn();   -- <<< PRINT WAL POS BEFORE VACUUM
VACUUM FREEZE VERBOSE t;

And then:

pg_waldump -p data/pg_wal/ -s 1/F4474498 --stats=record

where "1/F4474498" is the position printed by the "SELECT 
pg_current_wal_insert_lsn()" above.



Do you have any non-default settings? "select name, 
current_setting(name), source  from pg_settings where setting <> 
boot_val;" would show that.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: PG17 optimizations to vacuum

2024-09-02 Thread Peter Geoghegan
On Mon, Sep 2, 2024 at 4:35 PM Pavel Luzanov  wrote:
> If it helps, without creating index on id column, the numbers will be
> much closer:

Yes, avoiding all index vacuuming seems useful. It makes the test case
cleaner, since we don't have to think about the variability from the
TIDStore work (and from index vacuuming more generally).

> master:
> WAL usage: 78502 records, 22090 full page images, 196215494 bytes
>
> v15:
> WAL usage: 77437 records, 30872 full page images, 152080268 bytes

To be clear, I wouldn't be all that surprised if some issue with
alignment padding and/or the new WAL format made master write more
bytes of WAL total than on earlier versions. Small inefficiencies like
that were always a possibility. That's not the problem that I'm
principally concerned about right now.

I'm most concerned about the simple fact that we're writing more
individual WAL records than on previous versions, despite the fact
that VACUUM does exactly the same amount of useful pruning and
freezing work across versions. How could a patch that merged together
the previous PRUNE and VACUUM WAL record types into a single record
type actually cause an increase in the number of WAL records written?
I'd have thought that that was simply impossible.

-- 
Peter Geoghegan




Re: PG17 optimizations to vacuum

2024-09-02 Thread Peter Geoghegan
On Mon, Sep 2, 2024 at 4:58 PM Heikki Linnakangas  wrote:
> Do you have any non-default settings? "select name,
> current_setting(name), source  from pg_settings where setting <>
> boot_val;" would show that.

What about page checksums?

One simple explanation is that we're writing extra FPIs to set hint
bits. But that explanation only works if you assume that page-level
checksums are in use (or that wal_log_hints is turned on).

-- 
Peter Geoghegan




Re: PG17 optimizations to vacuum

2024-09-02 Thread Heikki Linnakangas

On 03/09/2024 00:01, Peter Geoghegan wrote:

On Mon, Sep 2, 2024 at 4:58 PM Heikki Linnakangas  wrote:

Do you have any non-default settings? "select name,
current_setting(name), source  from pg_settings where setting <>
boot_val;" would show that.


What about page checksums?

One simple explanation is that we're writing extra FPIs to set hint
bits. But that explanation only works if you assume that page-level
checksums are in use (or that wal_log_hints is turned on).


Hmm, yeah, that might be it. With chceksums enabled, I see ~120k WAL 
records, vs ~90k without checksums. But there's no difference between 
v16 and master.


Pavel, did you test v17 with checksums enabled and v16 with checksums 
disabled, by any chance?


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Partitioning and unique key

2024-09-02 Thread veem v
On Tue, 3 Sept 2024 at 01:14, Laurenz Albe  wrote:

>
> You can keep the primary key defined on both columns if it is good enough
> for you.
> But it will give you lower guarantees of uniqueness: with that primary
> key, there could
> be two rows with a different timestamp, but the same "txn_id", and these
> two rows could
> be in the same partition...
>
> Also, if you need a foreign key pointing *to* the partitioned table, you
> cannot do without
> a primary key.  But I recommend that you do *not* define such foreign
> keys: they will make
> it more difficult to detach a partition.
>
> If you partition two tables in the same way, you can use foreign keys
> between the partitions
> instead of foreign keys between the partitioned tables.  Such foreign keys
> won't be a problem.
>
>
Thank You so much.
As you rightly said *"they will make it more difficult to detach a
partition." , *we are really seeing a longer time when detaching parent
table partitions.It runs forever sometimes. So do you mean it's because we
have primary key defined table level or it's because we have FK defined in
table level (for multiple child tables which are also partitioned)?

 We were thinking it's because we have FK defined on tablelevel , so we
were planning to make the FK on partition level. But as you just pointed
now , even keeping the PK on table level will also make the detach
partition slow? I understand, for detaching partitions , it may be scanning
while child because of the FK defined on the table level. but i am unable
to understand how the table level PK impacts the detach partition from
parent here.

My understanding is PK can only be created on table level but not on the
partition level. On the partition level we only can have a "unique index"
defined. Correct me if my understanding is wrong.


Re: Partitioning and unique key

2024-09-02 Thread Laurenz Albe
On Tue, 2024-09-03 at 10:39 +0530, veem v wrote:
> As you rightly said "they will make it more difficult to detach a partition." 
> ,
> we are really seeing a longer time when detaching parent table partitions.
> It runs forever sometimes. So do you mean it's because we have primary key
> defined table level or it's because we have FK defined in table level
> (for multiple child tables which are also partitioned)?

I'd say it is because of the foreign key.

If you have a foreign key that points to a partitioned table, and you detach
a partition, PostgreSQL has to verify that that won't violate the constraint,
so it has to scan the tables, which will take time if the partitions are large.

> We were thinking it's because we have FK defined on tablelevel , so we were
> planning to make the FK on partition level.

Good move.

> But as you just pointed now , even keeping the PK on table level will also
> make the detach partition slow? I understand, for detaching partitions ,
> it may be scanning while child because of the FK defined on the table level.
> but i am unable to understand how the table level PK impacts the detach
> partition from parent here.

No, a primary key on the partitioned table won't be a problem for performance.

My concern was that if what you really would like is "id" to be unique, how does
a primary key on (id, some_timestamp) benefit you?

> My understanding is PK can only be created on table level but not on the
> partition level. On the partition level we only can have a "unique index"
> defined. Correct me if my understanding is wrong.

No, you can define a primary key on the partition.  That is, if you have no
primary key on the partitioned table.  A primary key on the partitioned table
is a primary key on each partition, and a table can only have a single primary
key, so adding another primary key on the partition would cause an error.

Yours,
Laurenz Albe