Hi Everyone,

I am working on the initial schema sync for Logical replication. Currently, 
user have to
manually create a schema on subscriber side. Aim of this feature is to add an 
option in
create subscription, so that schema sync can be automatic. I am sharing Design 
Doc below,
but there are some corner cases where the design does not work. Please share 
your opinion
if design can be improved and we can get rid of corner cases. This design is 
loosely based
on Pglogical.
DDL replication is required for this feature.
(https://www.postgresql.org/message-id/flat/CAAD30U%2BpVmfKwUKy8cbZOnUXyguJ-uBNejwD75Kyo%3DOjdQGJ9g%40mail.gmail.com)

SQL Changes:-
CREATE SUBSCRIPTION subscription_name
CONNECTION 'conninfo'
PUBLICATION publication_name [, ...]
[ WITH ( subscription_parameter [= value] [, ... ] ) ]
sync_initial_schema (enum) will be added to subscription_parameter.
It can have 3 values:-
TABLES, ALL , NONE (Default)
In ALL everything will be synced including global objects too.

Restrictions :- sync_initial_schema=ALL can only be used for publication with 
FOR ALL TABLES

Design:-

Publisher :-
Publisher have to implement `SHOW CREATE TABLE_NAME`, this table definition 
will be used by
subscriber to create exact schema of a table on the subscriber. One alternative 
to this can
be doing it on the subscriber side itself, we can create a function similar to
describeOneTableDetails and call it on the subscriber. We also need maintain 
same ownership
as of publisher.

It should also have turned on publication of DDL commands.

Subscriber :-

1. In  CreateSubscription()  when we create replication 
slot(walrcv_create_slot()), should
use CRS_EXPORT_SNAPSHOT, So that we can use this snapshot later in the pg_dump.

2.  Now we can call pg_dump with above snapshot from CreateSubscription. This 
is inside
opts.connect && opts.create_slot if statement. If we fail in this step we have 
to drop
the replication slot and create a new one again. Because we need snapshot and 
creating a
replication slot is a way to get snapshot. The reason for running pg_dump with 
above
snapshot is that we don't want execute DDLs in wal_logs to 2 times. With above 
snapshot we
get a state of database which is before the replication slot origin and any 
changes after
the snapshot will be in wal_logs.

We will save the pg_dump into a file (custom archive format). So pg_dump will 
be similar to
pg_dump --connection_string --schema_only --snapshot=xyz -Fc --file initSchema

If  sync_initial_schema=TABLES we dont have to call pg_dump/restore at all. 
TableSync process
will take care of it.

3. If we have to sync global objects we need to call pg_dumpall --globals-only 
also. But pg_dumpall
does not support --snapshot option, So if user creates a new global object 
between creation
of replication slot and running pg_dumpall, that above global object will be 
created 2
times on subscriber , which will error out the Applier process.

4. walrcv_disconnect should be called after pg_dump is finished, otherwise 
snapshot will
not be valid.

5. Users will replication role cant not call pg_dump , So the replication user 
have to
superuser. This is a a major problem.
postgres=# create role s4 WITH LOGIN Replication;
CREATE ROLE
╭─sachin@DUB-1800550165 ~
╰─$ pg_dump postgres -s -U s4                                                   
                            1 ↵
pg_dump: error: query failed: ERROR:  permission denied for table t1
pg_dump: detail: Query was: LOCK TABLE public.t1, public.t2 IN ACCESS SHARE MODE

6. pg_subscription_rel table column srsubstate will have one more state
SUBREL_STATE_CREATE 'c'. if sync_initial_schema is enabled we will set 
table_state to 'c'.
Above 6 steps will be done even if subscription is not enabled, but connect is 
true.

7. Leader Applier process should check if initSync file exist , if true  then 
it should
call pg_restore. We are not using —pre-data and —post-data segment as it is 
used in
Pglogical, Because post_data works on table having data , but we will fill the 
data into
table on later stages.  pg_restore can be called like this

pg_restore --connection_string -1 file_name
-1 option will execute every command inside of one transaction. If there is any 
error
everything will be rollbacked.
pg_restore should be called quite early in the Applier process code, before any 
tablesync
process can be created.
Instead of checking if file exist maybe pg_subscription table can be extended 
with column
SyncInitialSchema and applier process will check SyncInitialSchema == 
SYNC_PENDING

8. TableSync process should check the state of table , if it is 
SUBREL_STATE_CREATE it should
get the latest definition from the publisher and recreate the table. (We have 
to recreate
the table even if there are no changes). Then it should go into copy table mode 
as usual.

It might seem that TableSync is doing duplicate work already done by 
pg_restore. We are doing
it in this way because of concurrent DDLs and refresh publication command.

Concurrent DDL :-
User can execute a DDL command to table t1 at the same time when subscriber is 
trying to sync
it. pictorial representation https://imgur.com/a/ivrIEv8 [1]

In tablesync process, it makes a connection to the publisher and it sees the
table state which can be in future wrt to the publisher, which can introduce 
conflicts.
For example:-

CASE 1:- { Publisher removed the column b from the table t1 when subscriber was 
doing pg_restore
(or any point in concurrent DDL window described in picture [1] ), when 
tableSync
process will start transaction on the publisher it will see request data of 
table t1
including column b, which does not exist on the publisher.} So that is why 
tableSync process
asks for the latest definition.
If we say that we will delay tableSync worker till all the DDL related to table 
t1 is
applied by the applier process , we can still have a window when publisher 
issues a DDL
command just before tableSync starts its transaction, and therefore making 
tableSync and
publisher table definition incompatible (Thanks to Masahiko for pointing out 
this race
condition).

Applier process will skip all DDL/DMLs related to the table t1 and tableSync 
will apply those
in Catchup phase.
Although there is one issue what will happen to views/ or functions which 
depend on the table
. I think they should wait till table_state is > SUBREL_STATE_CREATE (means we 
have the latest
schema definition from the publisher).
There might be corner cases to this approach or maybe a better way to handle 
concurrent DDL
One simple solution might be to disallow DDLs on the publisher till all the 
schema is
synced and all tables have state >= SUBREL_STATE_DATASYNC (We can have CASE 1: 
issue ,
even with DDL replication, so we have to wait till all the tables have 
table_state
> SUBREL_STATE_DATASYNC). Which might be a big window for big databases.


Refresh publication :-
In refresh publication, subscriber does create a new replication slot hence , 
we can’t run
pg_dump with a snapshot which starts from origin(maybe this is not an issue at 
all). In this case
it makes more sense for tableSync worker to do schema sync.


If community is happy with above design, I can start working on prototype.

Credits :- This design is inspired by Pglogical. Also thanks to Zane, Masahiko, 
Amit for reviewing earlier designs

Regards
Sachin Kumar
Amazon Web Services

Reply via email to