Re: [PATCH] Allow to specify restart_lsn in pg_create_physical_replication_slot()

Alexey Kondratov Fri, 19 Jun 2020 07:21:16 -0700

On 2020-06-19 03:59, Michael Paquier wrote:

On Thu, Jun 18, 2020 at 03:39:09PM +0300, Vyacheslav Makarov wrote:
If the WAL segment for the specified restart_lsn (STOP_LSN of thebackup)exists, then the function will create a physical replication slot andwill
keep all the WAL segments required by the replica to catch up with the
primary. Otherwise, it returns error, which means that the requiredWALsegments have been already utilised, so we do need to take a newbackup.
Without passing this newly added parameter
pg_create_physical_replication_slot() works as before.
What do you think about this?
I think that this was discussed in the past (perhaps one of the
threads related to WAL advancing actually?),

I have searched through the archives a bit and found one thread relatedto slots advancing [1]. It was dedicated to a problem of advancing slotswhich do not reserve WAL yet, if I get it correctly. Although it issomehow related to the topic, it was a slightly different issue, IMO.


and this stuff is full of
holes when it comes to think about error handling with checkpoints
running in parallel, potentially doing recycling of segments you would
expect to be around based on your input value for restart_lsn *while*
pg_create_physical_replication_slot() is still running and
manipulating the on-disk slot information. I suspect that this also
breaks a couple of assumptions behind concurrent calls of the minimum
LSN calculated across slots when a caller sees fit to recompute the
thresholds (WAL senders mainly here, depending on the replication
activity).

These are the right concerns, but all of them should be applicable tothe pg_create_physical_replication_slot() + immediately_reserve == truein the same way, doesn't it? I think so, since in that case we are doinga pretty similar thing — trying to reserve some WAL segment that may beconcurrently deleted.

And this is exactly the reason why ReplicationSlotReserveWal() does itin several steps in a loop:


1. Creates a slot with some restart_lsn.

2. Does ReplicationSlotsComputeRequiredLSN() to prevent removal of theWAL segment with this restart_lsn.

3. Checks that required WAL segment is still there.
4. Repeat if this attempt to prevent WAL removal has failed.

I guess that the only difference in the case of proposed scenario isthat we do not have a chance for step 4, since we do need some specificrestart_lsn, not any recent restart_lsn, i.e. in this case we have to:


1. Create a slot with restart_lsn specified by user.
2. Do ReplicationSlotsComputeRequiredLSN() to prevent WAL removal.

3. Check that required WAL segment is still there and report ERROR tothe user if it is not.

I have eyeballed the attached patch and it looks like doing exactly thesame, so issues with concurrent deletion are not obvious for me. Or,there are should be the same issues forpg_create_physical_replication_slot() + immediately_reserve == true withcurrent master implementation.

[1]https://www.postgresql.org/message-id/flat/20180626071305.GH31353%40paquier.xyz



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

Re: [PATCH] Allow to specify restart_lsn in pg_create_physical_replication_slot()

Reply via email to