Hi Everyone, I am working on the initial schema sync for Logical replication. Currently, user have to manually create a schema on subscriber side. Aim of this feature is to add an option in create subscription, so that schema sync can be automatic. I am sharing Design Doc below, but there are some corner cases where the design does not work. Please share your opinion if design can be improved and we can get rid of corner cases. This design is loosely based on Pglogical. DDL replication is required for this feature. (https://www.postgresql.org/message-id/flat/CAAD30U%2BpVmfKwUKy8cbZOnUXyguJ-uBNejwD75Kyo%3DOjdQGJ9g%40mail.gmail.com)
SQL Changes:- CREATE SUBSCRIPTION subscription_name CONNECTION 'conninfo' PUBLICATION publication_name [, ...] [ WITH ( subscription_parameter [= value] [, ... ] ) ] sync_initial_schema (enum) will be added to subscription_parameter. It can have 3 values:- TABLES, ALL , NONE (Default) In ALL everything will be synced including global objects too. Restrictions :- sync_initial_schema=ALL can only be used for publication with FOR ALL TABLES Design:- Publisher :- Publisher have to implement `SHOW CREATE TABLE_NAME`, this table definition will be used by subscriber to create exact schema of a table on the subscriber. One alternative to this can be doing it on the subscriber side itself, we can create a function similar to describeOneTableDetails and call it on the subscriber. We also need maintain same ownership as of publisher. It should also have turned on publication of DDL commands. Subscriber :- 1. In CreateSubscription() when we create replication slot(walrcv_create_slot()), should use CRS_EXPORT_SNAPSHOT, So that we can use this snapshot later in the pg_dump. 2. Now we can call pg_dump with above snapshot from CreateSubscription. This is inside opts.connect && opts.create_slot if statement. If we fail in this step we have to drop the replication slot and create a new one again. Because we need snapshot and creating a replication slot is a way to get snapshot. The reason for running pg_dump with above snapshot is that we don't want execute DDLs in wal_logs to 2 times. With above snapshot we get a state of database which is before the replication slot origin and any changes after the snapshot will be in wal_logs. We will save the pg_dump into a file (custom archive format). So pg_dump will be similar to pg_dump --connection_string --schema_only --snapshot=xyz -Fc --file initSchema If sync_initial_schema=TABLES we dont have to call pg_dump/restore at all. TableSync process will take care of it. 3. If we have to sync global objects we need to call pg_dumpall --globals-only also. But pg_dumpall does not support --snapshot option, So if user creates a new global object between creation of replication slot and running pg_dumpall, that above global object will be created 2 times on subscriber , which will error out the Applier process. 4. walrcv_disconnect should be called after pg_dump is finished, otherwise snapshot will not be valid. 5. Users will replication role cant not call pg_dump , So the replication user have to superuser. This is a a major problem. postgres=# create role s4 WITH LOGIN Replication; CREATE ROLE ╭─sachin@DUB-1800550165 ~ ╰─$ pg_dump postgres -s -U s4 1 ↵ pg_dump: error: query failed: ERROR: permission denied for table t1 pg_dump: detail: Query was: LOCK TABLE public.t1, public.t2 IN ACCESS SHARE MODE 6. pg_subscription_rel table column srsubstate will have one more state SUBREL_STATE_CREATE 'c'. if sync_initial_schema is enabled we will set table_state to 'c'. Above 6 steps will be done even if subscription is not enabled, but connect is true. 7. Leader Applier process should check if initSync file exist , if true then it should call pg_restore. We are not using —pre-data and —post-data segment as it is used in Pglogical, Because post_data works on table having data , but we will fill the data into table on later stages. pg_restore can be called like this pg_restore --connection_string -1 file_name -1 option will execute every command inside of one transaction. If there is any error everything will be rollbacked. pg_restore should be called quite early in the Applier process code, before any tablesync process can be created. Instead of checking if file exist maybe pg_subscription table can be extended with column SyncInitialSchema and applier process will check SyncInitialSchema == SYNC_PENDING 8. TableSync process should check the state of table , if it is SUBREL_STATE_CREATE it should get the latest definition from the publisher and recreate the table. (We have to recreate the table even if there are no changes). Then it should go into copy table mode as usual. It might seem that TableSync is doing duplicate work already done by pg_restore. We are doing it in this way because of concurrent DDLs and refresh publication command. Concurrent DDL :- User can execute a DDL command to table t1 at the same time when subscriber is trying to sync it. pictorial representation https://imgur.com/a/ivrIEv8 [1] In tablesync process, it makes a connection to the publisher and it sees the table state which can be in future wrt to the publisher, which can introduce conflicts. For example:- CASE 1:- { Publisher removed the column b from the table t1 when subscriber was doing pg_restore (or any point in concurrent DDL window described in picture [1] ), when tableSync process will start transaction on the publisher it will see request data of table t1 including column b, which does not exist on the publisher.} So that is why tableSync process asks for the latest definition. If we say that we will delay tableSync worker till all the DDL related to table t1 is applied by the applier process , we can still have a window when publisher issues a DDL command just before tableSync starts its transaction, and therefore making tableSync and publisher table definition incompatible (Thanks to Masahiko for pointing out this race condition). Applier process will skip all DDL/DMLs related to the table t1 and tableSync will apply those in Catchup phase. Although there is one issue what will happen to views/ or functions which depend on the table . I think they should wait till table_state is > SUBREL_STATE_CREATE (means we have the latest schema definition from the publisher). There might be corner cases to this approach or maybe a better way to handle concurrent DDL One simple solution might be to disallow DDLs on the publisher till all the schema is synced and all tables have state >= SUBREL_STATE_DATASYNC (We can have CASE 1: issue , even with DDL replication, so we have to wait till all the tables have table_state > SUBREL_STATE_DATASYNC). Which might be a big window for big databases. Refresh publication :- In refresh publication, subscriber does create a new replication slot hence , we can’t run pg_dump with a snapshot which starts from origin(maybe this is not an issue at all). In this case it makes more sense for tableSync worker to do schema sync. If community is happy with above design, I can start working on prototype. Credits :- This design is inspired by Pglogical. Also thanks to Zane, Masahiko, Amit for reviewing earlier designs Regards Sachin Kumar Amazon Web Services