Initial Schema Sync for Logical Replication

2023-03-15 Thread Kumar, Sachin
Hi Everyone,

I am working on the initial schema sync for Logical replication. Currently, 
user have to
manually create a schema on subscriber side. Aim of this feature is to add an 
option in
create subscription, so that schema sync can be automatic. I am sharing Design 
Doc below,
but there are some corner cases where the design does not work. Please share 
your opinion
if design can be improved and we can get rid of corner cases. This design is 
loosely based
on Pglogical.
DDL replication is required for this feature.
(https://www.postgresql.org/message-id/flat/CAAD30U%2BpVmfKwUKy8cbZOnUXyguJ-uBNejwD75Kyo%3DOjdQGJ9g%40mail.gmail.com)

SQL Changes:-
CREATE SUBSCRIPTION subscription_name
CONNECTION 'conninfo'
PUBLICATION publication_name [, ...]
[ WITH ( subscription_parameter [= value] [, ... ] ) ]
sync_initial_schema (enum) will be added to subscription_parameter.
It can have 3 values:-
TABLES, ALL , NONE (Default)
In ALL everything will be synced including global objects too.

Restrictions :- sync_initial_schema=ALL can only be used for publication with 
FOR ALL TABLES

Design:-

Publisher :-
Publisher have to implement `SHOW CREATE TABLE_NAME`, this table definition 
will be used by
subscriber to create exact schema of a table on the subscriber. One alternative 
to this can
be doing it on the subscriber side itself, we can create a function similar to
describeOneTableDetails and call it on the subscriber. We also need maintain 
same ownership
as of publisher.

It should also have turned on publication of DDL commands.

Subscriber :-

1. In  CreateSubscription()  when we create replication 
slot(walrcv_create_slot()), should
use CRS_EXPORT_SNAPSHOT, So that we can use this snapshot later in the pg_dump.

2.  Now we can call pg_dump with above snapshot from CreateSubscription. This 
is inside
opts.connect && opts.create_slot if statement. If we fail in this step we have 
to drop
the replication slot and create a new one again. Because we need snapshot and 
creating a
replication slot is a way to get snapshot. The reason for running pg_dump with 
above
snapshot is that we don't want execute DDLs in wal_logs to 2 times. With above 
snapshot we
get a state of database which is before the replication slot origin and any 
changes after
the snapshot will be in wal_logs.

We will save the pg_dump into a file (custom archive format). So pg_dump will 
be similar to
pg_dump --connection_string --schema_only --snapshot=xyz -Fc --file initSchema

If  sync_initial_schema=TABLES we dont have to call pg_dump/restore at all. 
TableSync process
will take care of it.

3. If we have to sync global objects we need to call pg_dumpall --globals-only 
also. But pg_dumpall
does not support --snapshot option, So if user creates a new global object 
between creation
of replication slot and running pg_dumpall, that above global object will be 
created 2
times on subscriber , which will error out the Applier process.

4. walrcv_disconnect should be called after pg_dump is finished, otherwise 
snapshot will
not be valid.

5. Users will replication role cant not call pg_dump , So the replication user 
have to
superuser. This is a a major problem.
postgres=# create role s4 WITH LOGIN Replication;
CREATE ROLE
╭─sachin@DUB-1800550165 ~
╰─$ pg_dump postgres -s -U s4   
1 ↵
pg_dump: error: query failed: ERROR:  permission denied for table t1
pg_dump: detail: Query was: LOCK TABLE public.t1, public.t2 IN ACCESS SHARE MODE

6. pg_subscription_rel table column srsubstate will have one more state
SUBREL_STATE_CREATE 'c'. if sync_initial_schema is enabled we will set 
table_state to 'c'.
Above 6 steps will be done even if subscription is not enabled, but connect is 
true.

7. Leader Applier process should check if initSync file exist , if true  then 
it should
call pg_restore. We are not using —pre-data and —post-data segment as it is 
used in
Pglogical, Because post_data works on table having data , but we will fill the 
data into
table on later stages.  pg_restore can be called like this

pg_restore --connection_string -1 file_name
-1 option will execute every command inside of one transaction. If there is any 
error
everything will be rollbacked.
pg_restore should be called quite early in the Applier process code, before any 
tablesync
process can be created.
Instead of checking if file exist maybe pg_subscription table can be extended 
with column
SyncInitialSchema and applier process will check SyncInitialSchema == 
SYNC_PENDING

8. TableSync process should check the state of table , if it is 
SUBREL_STATE_CREATE it should
get the latest definition from the publisher and recreate the table. (We have 
to recreate
the table even if there are no changes). Then it should go into copy table mode 
as usual.

It might seem that TableSync is doing duplicate work already done by 
pg_restore. We are doing
it in this way because of concurrent DDLs and refresh publication co

RE: Initial Schema Sync for Logical Replication

2023-03-16 Thread Kumar, Sachin
Hi Peter,

> Hi,
> 
> I have a couple of questions.
> 
> Q1.
> 
> What happens if the subscriber already has some tables present? For
> example, I did not see the post saying anything like "Only if the table does
> not already exist then it will be created".
> 
My assumption was the if subscriber is doing initial schema sync , It does not 
have
any conflicting database objects.
> On the contrary, the post seemed to say SUBREL_STATE_CREATE 'c' would
> *always* be set when this subscriber mode is enabled. And then it seemed
> to say the table would *always* get re-created by the tablesync in this new
> mode.
Right
> 
> Won't this cause problems
> - if the user wanted a slightly different subscriber-side table? (eg some 
> extra
> columns on the subscriber-side table)
> - if there was some pre-existing table data on the subscriber-side table that
> you now are about to re-create and clobber?
> 
> Or does the idea intend that the CREATE TABLE DDL that will be executed is
> like "CREATE TABLE ... IF NOT EXISTS"?
> 
pg_dump does not support --if-not-exists , But I think it can be added  and we 
get a
dump with IF NOT EXISTS.
On subscriber side we get table OID list, we can use this change table_state
= SUBREL_STATE_INIT so that it won't be recreated. 
> ~~~
> 
> Q2.
> 
> The post says. "DDL replication is required for this feature". And "It should
> also have turned on publication of DDL commands."
> 
> It wasn't entirely clear to me why those must be a requirement. Is that just 
> to
> make implementation easier?
DDL replication is needed to facilitate concurrent DDL, so that we don’t have to
worry about schema change at the same time when subscriber is initializing.
if we can block publisher so that it does not do DDLs or subscriber can simple
error out if it sees conflicting table information , then we don’t need to use 
DDL
replication. 
Regards
Sachin
> 
> Sure, I see that the idea might have some (or maybe a lot?) of common
> internal code with the table DDL replication work, but OTOH an auto-create
> feature for subscriber tables seems like it might be a useful feature to have
> regardless of the value of the publication 'ddl' parameter.
> 
> --
> Kind Regards,
> Peter Smith.
> Fujitsu Australia.


RE: Initial Schema Sync for Logical Replication

2023-03-20 Thread Kumar, Sachin
Hi Amit,

> From: Amit Kapila 
> > > Hi,
> > >
> > > I have a couple of questions.
> > >
> > > Q1.
> > >
> > > What happens if the subscriber already has some tables present? For
> > > example, I did not see the post saying anything like "Only if the
> > > table does not already exist then it will be created".
> > >
> > My assumption was the if subscriber is doing initial schema sync , It
> > does not have any conflicting database objects.
> >
> 
> Can't we simply error out in such a case with "obj already exists"?
> This would be similar to how we deal with conflicting rows with unique/primary
> keys.
Right this is the default behaviour , We will run pg_restore with 
--single_transaction,
So if we get error while executing a create table the whole pg_restore will 
fail and 
user will notified. 
Regards
Sachin


RE: Initial Schema Sync for Logical Replication

2023-03-20 Thread Kumar, Sachin
Hi Alvaro,

> From: Alvaro Herrera 
> Subject: RE: [EXTERNAL]Initial Schema Sync for Logical Replication
> On 2023-Mar-15, Kumar, Sachin wrote:
> 
> > 1. In  CreateSubscription()  when we create replication
> > slot(walrcv_create_slot()), should use CRS_EXPORT_SNAPSHOT, So that we
> can use this snapshot later in the pg_dump.
> >
> > 2.  Now we can call pg_dump with above snapshot from CreateSubscription.
> 
> Overall I'm not on board with the idea that logical replication would depend 
> on
> pg_dump; that seems like it could run into all sorts of trouble (what if 
> calling
> external binaries requires additional security setup?  what about pg_hba
> connection requirements? what about max_connections in tight
> circumstances?).
> what if calling external binaries requires additional security setup
I am not sure what kind of security restriction would apply in this case, maybe 
pg_dump
binary can be changed ? 
> what about pg_hba connection requirements?
We will use the same connection string which subscriber process uses to connect 
to
the publisher.
>what about max_connections in tight circumstances?
Right that might be a issue, but I don’t think it will be a big issue, We will 
create dump
of database in CreateSubscription() function itself , So before tableSync 
process even starts
if we have reached max_connections while calling pg_dump itself , tableSync 
wont be successful.
> It would be much better, I think, to handle this internally in the publisher 
> instead:
> similar to how DDL sync would work, except it'd somehow generate the CREATE
> statements from the existing tables instead of waiting for DDL events to 
> occur.  I
> grant that this does require writing a bunch of new code for each object 
> type, a
> lot of which would duplicate the pg_dump logic, but it would probably be a lot
> more robust.
Agree , But we might have a lots of code duplication essentially almost all of 
pg_dump
Code needs to be duplicated, which might cause issue when modifying/adding new
DDLs. 
I am not sure but if it's possible to move dependent code of pg_dump to common/ 
folder
, to avoid duplication.

Regards
Sachin


RE: Initial Schema Sync for Logical Replication

2023-03-22 Thread Kumar, Sachin
> From: Amit Kapila 
> Sent: Wednesday, March 22, 2023 5:16 AM
> To: Masahiko Sawada 
> Cc: Euler Taveira ; Kumar, Sachin
> ; Alvaro Herrera ; pgsql-
> hack...@lists.postgresql.org; Jonathan S. Katz 
> Subject: RE: [EXTERNAL]Initial Schema Sync for Logical Replication
> 
> CAUTION: This email originated from outside of the organization. Do not click
> links or open attachments unless you can confirm the sender and know the
> content is safe.
> 
> 
> 
> On Wed, Mar 22, 2023 at 8:29 AM Masahiko Sawada
>  wrote:
> >
> > On Tue, Mar 21, 2023 at 8:18 PM Amit Kapila 
> wrote:
> > >
> > > On Tue, Mar 21, 2023 at 7:32 AM Euler Taveira 
> wrote:
> > >
> > > > You should
> > > > exclude them removing these objects from the TOC before running
> > > > pg_restore or adding a few pg_dump options to exclude these
> > > > objects. Another issue is related to different version. Let's say
> > > > the publisher has a version ahead of the subscriber version, a new
> > > > table syntax can easily break your logical replication setup. IMO
> > > > pg_dump doesn't seem like a good solution for initial synchronization.
> > > >
> > > > Instead, the backend should provide infrastructure to obtain the
> > > > required DDL commands for the specific (set of) tables. This can
> > > > work around the issues from the previous paragraph:
> > > >
> > > ...
> > > > * don't need to worry about different versions.
> > > >
> > >
> > > AFAICU some of the reasons why pg_dump is not allowed to dump from
> > > the newer version are as follows: (a) there could be more columns in
> > > the newer version of the system catalog and then Select * type of
> > > stuff won't work because the client won't have knowledge of
> > > additional columns. (b) the newer version could have new features
> > > (represented by say new columns in existing catalogs or new
> > > catalogs) that the older version of pg_dump has no knowledge of and
> > > will fail to get that data and hence an inconsistent dump. The
> > > subscriber will easily be not in sync due to that.
> > >
> > > Now, how do we avoid these problems even if we have our own version
> > > of functionality similar to pg_dump for selected objects? I guess we
> > > will face similar problems.
> >
> > Right. I think that such functionality needs to return DDL commands
> > that can be executed on the requested version.
> >
> > > If so, we may need to deny schema sync in any such case.
> >
> > Yes. Do we have any concrete use case where the subscriber is an older
> > version, in the first place?
> >
> 
> As per my understanding, it is mostly due to the reason that it can work
> today. Today, during an off-list discussion with Jonathan on this point, he
> pointed me to a similar incompatibility in MySQL replication. See the "SQL
> incompatibilities" section in doc[1]. Also, please note that this applies not
> only to initial sync but also to schema sync during replication. I don't 
> think it
> would be feasible to keep such cross-version compatibility for DDL
> replication.
> 
> Having said above, I don't intend that we must use pg_dump from the
> subscriber for the purpose of initial sync. I think the idea at this stage is 
> to
> primarily write a POC patch to see what difficulties we may face. The other
> options that we could try out are (a) try to duplicate parts of pg_dump code
> in some way (by extracting required
> code) for the subscription's initial sync, or (b) have a common code (probably
> as a library or some other way) for the required functionality. There could be
> more possibilities that we may not have thought of yet. But the main point is
> that for approaches other than using pg_dump, we should consider ways to
> avoid duplicity of various parts of its code. Due to this, I think before 
> ruling
> out using pg_dump, we should be clear about its risks and limitations.
> 
> Thoughts?
There is one more thing which needs to be consider even if we use 
pg_dump/pg_restore
We still need to have a way to get the create table for tables , if we want to 
support
concurrent DDLs on the publisher.
>8. TableSync process should check the state of table , if it is 
>SUBREL_STATE_CREATE it should
>get the latest definition from the publisher and recreate the table. (We have 
>to recreate
>the table even if there are no changes). Then it should go into copy table 
>mode as usual.
Unless there is different way to support concurrent DDLs or we going for 
blocking publisher
till initial sync is completed.
Regards
Sachin
> 
> [1] - https://dev.mysql.com/doc/refman/8.0/en/replication-
> compatibility.html
> [2] - https://www.postgresql.org/message-
> id/CAAD30U%2BpVmfKwUKy8cbZOnUXyguJ-
> uBNejwD75Kyo%3DOjdQGJ9g%40mail.gmail.com
> 
> --
> With Regards,
> Amit Kapila.


RE: Initial Schema Sync for Logical Replication

2023-03-23 Thread Kumar, Sachin
> From: Amit Kapila 
> IIUC, this is possible only if tablesync process uses a snapshot different 
> than the
> snapshot we have used to perform the initial schema sync, otherwise, this
> shouldn't be a problem. Let me try to explain my understanding with an example
> (the LSNs used are just explain the
> problem):
> 
> 1. Create Table t1(c1, c2); --LSN: 90
> 2. Insert t1 (1, 1); --LSN 100
> 3. Insert t1 (2, 2); --LSN 110
> 4. Alter t1 Add Column c3; --LSN 120
> 5. Insert t1 (3, 3, 3); --LSN 130
> 
> Now, say before starting tablesync worker, apply process performs initial
> schema sync and uses a snapshot corresponding to LSN 100. Then it starts
> tablesync process to allow the initial copy of data in t1.
> Here, if the table sync process tries to establish a new snapshot, it may get 
> data
> till LSN 130 and when it will try to copy the same in subscriber it will 
> fail. Is my
> understanding correct about the problem you described?
Right
> If so, can't we allow
> tablesync process to use the same exported snapshot as we used for the initial
> schema sync and won't that solve the problem you described?
I think we won't be able to use same snapshot because the transaction will be 
committed.
In CreateSubscription() we can use the transaction snapshot from 
walrcv_create_slot()
till walrcv_disconnect() is called.(I am not sure about this part maybe 
walrcv_disconnect() calls
the commits internally ?). 
So somehow we need to keep this snapshot alive, even after transaction is 
committed(or delay committing
the transaction , but we can have CREATE SUBSCRIPTION with ENABLED=FALSE, so we 
can have a restart before 
tableSync is able to use the same snapshot.) 
> > Refresh publication :-
> >
> > In refresh publication, subscriber does create a new replication slot
Typo-> subscriber does not
> > hence , we can’t run
> >
> > pg_dump with a snapshot which starts from origin(maybe this is not an
> > issue at all). In this case
> >
> > it makes more sense for tableSync worker to do schema sync.
> >
> 
> Can you please explain this problem with some examples?
I think we can have same issues as you mentioned
New table t1 is added to the publication , User does a refresh publication.
pg_dump / pg_restore restores the table definition. But before tableSync
can start,  steps from 2 to 5 happen on the publisher.
> 1. Create Table t1(c1, c2); --LSN: 90
> 2. Insert t1 (1, 1); --LSN 100
> 3. Insert t1 (2, 2); --LSN 110
> 4. Alter t1 Add Column c3; --LSN 120
> 5. Insert t1 (3, 3, 3); --LSN 130
And table sync errors out
There can be one more issue , since we took the pg_dump without snapshot (wrt 
to replication slot).
(I am not 100 percent sure about this).
Lets imagine applier process is lagging behind publisher. 
Events on publisher
1. alter t1 drop column c; LSN 100 <-- applier process tries to execute this DDL
2. alter t1 drop column d; LSN 110
3. insert into t1 values(..); LSN 120 <-- (Refresh publication called 
)pg_dump/restore restores this version
Applier process executing 1 will fail because t1 does not have column c. 
Regards
Sachin


RE: Initial Schema Sync for Logical Replication

2023-03-24 Thread Kumar, Sachin
> From: Amit Kapila 
> > I think we won't be able to use same snapshot because the transaction will
> > be committed.
> > In CreateSubscription() we can use the transaction snapshot from
> > walrcv_create_slot() till walrcv_disconnect() is called.(I am not sure
> > about this part maybe walrcv_disconnect() calls the commits internally ?).
> > So somehow we need to keep this snapshot alive, even after transaction
> > is committed(or delay committing the transaction , but we can have
> > CREATE SUBSCRIPTION with ENABLED=FALSE, so we can have a restart
> > before tableSync is able to use the same snapshot.)
> >
> 
> Can we think of getting the table data as well along with schema via
> pg_dump? Won't then both schema and initial data will correspond to the
> same snapshot?

Right , that will work, Thanks!

> > I think we can have same issues as you mentioned New table t1 is added
> > to the publication , User does a refresh publication.
> > pg_dump / pg_restore restores the table definition. But before
> > tableSync can start,  steps from 2 to 5 happen on the publisher.
> > > 1. Create Table t1(c1, c2); --LSN: 90 2. Insert t1 (1, 1); --LSN 100
> > > 3. Insert t1 (2, 2); --LSN 110 4. Alter t1 Add Column c3; --LSN 120
> > > 5. Insert t1 (3, 3, 3); --LSN 130
> > And table sync errors out
> > There can be one more issue , since we took the pg_dump without
> snapshot (wrt to replication slot).
> >
> 
> To avoid both the problems mentioned for Refresh Publication, we can do
> one of the following: (a) create a new slot along with a snapshot for this
> operation and drop it afterward; or (b) using the existing slot, establish a
> new snapshot using a technique proposed in email [1].
> 

Thanks, I think option (b) will be perfect, since we don’t have to create a new 
slot.

Regards
Sachin


RE: Initial Schema Sync for Logical Replication

2023-03-24 Thread Kumar, Sachin

> I am not if it's feasible to support the use case the replicate DDL to old
> subscriber.
>

+1
 
> First, I think the current publisher doesn't know the version number of
> client(subscriber) so we need to check the feasibility of same. Also, having
> client's version number checks doesn't seem to be a good idea.
> 
> Besides, I thought about the problems that will happen if we try to support
> replicating New PG to older PG. The following examples assume that we
> support the DDL replication in the mentioned PG.
> 
> 1) Assume we want to replicate from a newer PG to a older PG where
> partition
>table has not been introduced. I think even if the publisher is aware of
>that, it doesn't have a good way to transform the partition related
> command,
>maybe one could say we can transform that to inherit table, but I feel that
>introduces too much complexity.
> 
> 2) Another example is generated column. To replicate the newer PG which
> has
>this feature to a older PG without this. I am concerned that is there a way
>to transform this without causing inconsistent behavior.
> 
> Even if we decide to simply skip sending such unsupported commands or
> skip applying them, then it's likely that the following dml replication will
> cause data inconsistency.
> 
> So, it seems we cannot completely support this use case, there would be
> some limitations. Personally, I am not sure if it's worth introducing
> complexity to support it partially.
> 

+1

Regards
Sachin




RE: Initial Schema Sync for Logical Replication

2023-03-29 Thread Kumar, Sachin
> > > > From: Amit Kapila 
> > > > > I think we won't be able to use same snapshot because the
> > > > > transaction will be committed.
> > > > > In CreateSubscription() we can use the transaction snapshot from
> > > > > walrcv_create_slot() till walrcv_disconnect() is called.(I am
> > > > > not sure about this part maybe walrcv_disconnect() calls the commits
> internally ?).
> > > > > So somehow we need to keep this snapshot alive, even after
> > > > > transaction is committed(or delay committing the transaction ,
> > > > > but we can have CREATE SUBSCRIPTION with ENABLED=FALSE, so we
> > > > > can have a restart before tableSync is able to use the same
> > > > > snapshot.)
> > > > >
> > > >
> > > > Can we think of getting the table data as well along with schema
> > > > via pg_dump? Won't then both schema and initial data will
> > > > correspond to the same snapshot?
> > >
> > > Right , that will work, Thanks!
> >
> > While it works, we cannot get the initial data in parallel, no?
> >

I was thinking each TableSync process will call pg_dump --table, This way if we 
have N
tableSync process, we can have N pg_dump --table=table_name called in parallel.
In fact we can use --schema-only to get schema and then let COPY take care of 
data
syncing . We will use same snapshot for pg_dump as well as COPY table. 

Regards
Sachin


RE: Initial Schema Sync for Logical Replication

2023-03-29 Thread Kumar, Sachin
> From: Masahiko Sawada 
> > 
> > One related idea is that currently, we fetch the table list
> > corresponding to publications in subscription and create the entries
> > for those in pg_subscription_rel during Create Subscription, can we
> > think of postponing that work till after the initial schema sync? We
> > seem to be already storing publications list in pg_subscription, so it
> > appears possible if we somehow remember the value of copy_data. If
> > this is feasible then I think that may give us the flexibility to
> > perform the initial sync at a later point by the background worker.

Maybe we need to add column to pg_subscription to store copy_data state ?

> 
> It sounds possible. With this idea, we will be able to have the apply worker
> restore the table schemas (and create pg_subscription_rel
> entries) as the first thing. Another point we might need to consider is that 
> the
> initial schema sync (i.e. creating tables) and creating pg_subscription_rel 
> entries
> need to be done in the same transaction.
> Otherwise, we could end up committing either one change. I think it depends on
> how we restore the schema data.

I think we have to add one more column to pg_subscription to track the initial 
sync
state {OFF, SCHEMA_DUMPED, SCHEMA_RESTORED, COMPLETED} (COMPLETED will
show that pg_subscription_rel is filled) . I don’t think we won't be able
to do pg_restore and pg_subscription_rel in single transaction, but we can use 
initial_sync_state to start from where we left after a abrupt crash/shutdown.

Regards
Sachin





RE: Initial Schema Sync for Logical Replication

2023-04-02 Thread Kumar, Sachin


> -Original Message-
> From: Masahiko Sawada 
> > > I was thinking each TableSync process will call pg_dump --table,
> > > This way if we have N tableSync process, we can have N pg_dump --
> table=table_name called in parallel.
> > > In fact we can use --schema-only to get schema and then let COPY
> > > take care of data syncing . We will use same snapshot for pg_dump as well
> as COPY table.
> >
> > How can we postpone creating the pg_subscription_rel entries until the
> > tablesync worker starts and does the schema sync? I think that since
> > pg_subscription_rel entry needs the table OID, we need either to do
> > the schema sync before creating the entry (i.e, during CREATE
> > SUBSCRIPTION) or to postpone creating entries as Amit proposed[1]. The
> > apply worker needs the information of tables to sync in order to
> > launch the tablesync workers, but it needs to create the table schema
> > to get that information.
> 
> For the above reason, I think that step 6 of the initial proposal won't work.
> 
> If we can have the tablesync worker create an entry of pg_subscription_rel 
> after
> creating the table, it may give us the flexibility to perform the initial 
> sync. One
> idea is that we add a relname field to pg_subscription_rel so that we can 
> create
> entries with relname instead of OID if the table is not created yet. Once the
> table is created, we clear the relname field and set the OID of the table 
> instead.
> It's not an ideal solution but we might make it simpler later.
> 
> Assuming that it's feasible, I'm considering another approach for the initial 
> sync
> in order to address the concurrent DDLs.
> 
> The basic idea is to somewhat follow how pg_dump/restore to dump/restore
> the database data. We divide the synchronization phase (including both schema
> and data) up into three phases: pre-data, table-data, post-data. These mostly
> follow the --section option of pg_dump.
> 
> 1. The backend process performing CREATE SUBSCRIPTION creates the
> subscription but doesn't create pg_subscription_rel entries yet.
> 
> 2. Before starting the streaming, the apply worker fetches the table list 
> from the
> publisher, create pg_subscription_rel entries for them, and dumps+restores
> database objects that tables could depend on but don't depend on tables such 
> as
> TYPE, OPERATOR, ACCESS METHOD etc (i.e.
> pre-data).

We will not have slot starting snapshot, So somehow we have to get a new 
snapshot
And skip all the wal_log between starting of slot and snapshot creation lsn ? .

> 
> 3. The apply worker launches the tablesync workers for tables that need to be
> synchronized.
> 
> There might be DDLs executed on the publisher for tables before the tablesync
> worker starts. But the apply worker needs to apply DDLs for pre-data database
> objects. OTOH, it can ignore DDLs for not-synced-yet tables and other database
> objects such as INDEX, TRIGGER, RULE, etc (i.e. post-data).
> 
> 4. The tablesync worker creates its replication slot, dumps+restores the table
> schema, update the pg_subscription_rel, and perform COPY.
> 
> These operations should be done in the same transaction.

pg_restore wont be rollbackable, So we need to maintain states in 
pg_subscription_rel.

> 
> 5. After finishing COPY, the tablesync worker dumps indexes (and perhaps
> constraints) of the table and creates them (which possibly takes a long time).
> Then it starts to catch up, same as today. The apply worker needs to wait for 
> the
> tablesync worker to catch up.

I don’t think we can have CATCHUP stage. We can have a DDL on publisher which
can add a new column (And this DDL will be executed by applier later). Then we 
get a INSERT
 because we have old definition of table, insert will fail.

> 
> We need to repeat these steps until we complete the initial data copy and 
> create
> indexes for all tables, IOW until all pg_subscription_rel status becomes 
> READY.
> 
> 6. If the apply worker confirms all tables are READY, it starts another sync
> worker who is responsible for the post-data database objects such as TRIGGER,
> RULE, POLICY etc (i.e. post-data).
> 
> While the sync worker is starting up or working, the apply worker applies
> changes for pre-data database objects as well as READY tables.
We might have some issue if we have create table like
Create table_name as select * from materialized_view.
> 
> 7. Similar to the tablesync worker, this sync worker creates its replication 
> slot
> and sets the returned LSN somewhere, say pg_subscription.
> 
> 8. The sync worker dumps and restores these objects. Which could take a time
> since it would need to create FK constraints. Then it starts to catch up if 
> the
> apply worker is ahead. The apply worker waits for the sync worker to  catch 
> up.
> 
> 9. Once the sync worker catches up, the apply worker starts applying changes
> for all database objects.
> 
> IIUC with this approach, we can resolve the concurrent DDL problem Sachin
> mentioned, and indexe

RE: Initial Schema Sync for Logical Replication

2023-04-05 Thread Kumar, Sachin
> From: Masahiko Sawada 
> > >
> > > 3. The apply worker launches the tablesync workers for tables that
> > > need to be synchronized.
> > >
> > > There might be DDLs executed on the publisher for tables before the
> > > tablesync worker starts. But the apply worker needs to apply DDLs
> > > for pre-data database objects. OTOH, it can ignore DDLs for
> > > not-synced-yet tables and other database objects such as INDEX,
> TRIGGER, RULE, etc (i.e. post-data).
> > >
> > > 4. The tablesync worker creates its replication slot, dumps+restores
> > > the table schema, update the pg_subscription_rel, and perform COPY.
> > >
> > > These operations should be done in the same transaction.
> >
> > pg_restore wont be rollbackable, So we need to maintain states in
> pg_subscription_rel.
> 
> Yes. But I think it depends on how we restore them. For example, if we have
> the tablesync worker somethow restore the table using a new SQL function
> returning the table schema as we discussed or executing the dump file via
> SPI, we can do that in the same transaction.

okay

> 
> >
> > >
> > > 5. After finishing COPY, the tablesync worker dumps indexes (and
> > > perhaps
> > > constraints) of the table and creates them (which possibly takes a long
> time).
> > > Then it starts to catch up, same as today. The apply worker needs to
> > > wait for the tablesync worker to catch up.
> >
> > I don’t think we can have CATCHUP stage. We can have a DDL on
> > publisher which can add a new column (And this DDL will be executed by
> > applier later). Then we get a INSERT  because we have old definition of
> table, insert will fail.
> 
> All DMLs and DDLs associated with the table being synchronized are applied
> by the tablesync worker until it catches up with the apply worker.

Right, Sorry I forgot that in above case if definition on publisher changes we 
will also have a 
corresponding DDLs.

> 
> >
> > >
> > > We need to repeat these steps until we complete the initial data
> > > copy and create indexes for all tables, IOW until all pg_subscription_rel
> status becomes READY.
> > >
> > > 6. If the apply worker confirms all tables are READY, it starts
> > > another sync worker who is responsible for the post-data database
> > > objects such as TRIGGER, RULE, POLICY etc (i.e. post-data).
> > >
> > > While the sync worker is starting up or working, the apply worker
> > > applies changes for pre-data database objects as well as READY tables.
> > We might have some issue if we have create table like Create
> > table_name as select * from materialized_view.
> 
> Could you elaborate on the scenario where we could have an issue with such
> DDL?

Since materialized view of publisher has not been created by subscriber yet
So if we have a DDL which does a create table using a materialized view
it will fail. I am not sure how DDL patch is handling create table as 
statements.
If it is modified to become like a normal CREATE TABLE then we wont have any 
issues.   

> 
> > >
> > > 7. Similar to the tablesync worker, this sync worker creates its
> > > replication slot and sets the returned LSN somewhere, say
> pg_subscription.
> > >
> > > 8. The sync worker dumps and restores these objects. Which could
> > > take a time since it would need to create FK constraints. Then it
> > > starts to catch up if the apply worker is ahead. The apply worker waits 
> > > for
> the sync worker to  catch up.
> > >
> > > 9. Once the sync worker catches up, the apply worker starts applying
> > > changes for all database objects.
> > >
> > > IIUC with this approach, we can resolve the concurrent DDL problem
> > > Sachin mentioned, and indexes (and constraints) are created after the
> initial data copy.
> > >
> > > The procedures are still very complex and not fully considered yet
> > > but I hope there are some useful things at least for discussion.
> > >
> > > Probably we can start with supporting only tables. In this case, we
> > > would not need the post-data phase (i.e. step 6-9). It seems to me
> > > that we need to have the subscription state somewhere (maybe
> > > pg_subscription) so that the apply worker figure out the next step.
> > > Since we need to dump and restore different objects on different
> > > timings, we probably cannot directly use pg_dump/pg_restore. I've
> > > not considered how the concurrent REFRESH PUBLICATION works.
> >
> > I think above prototype will work and will have least amount of side
> > effects, but It might be too complex to implement and I am not sure about
> corner cases.
> >
> > I was thinking of other ways of doing Initial Sync , which are less
> > complex but each with separate set of bottlenecks
> >
> > On Publisher Side:-
> > 1) Locking the publisher:- Easiest one to implement, applier process
> > will get Access Shared lock on the all the published tables. (We don't
> > have to worry newly created concurrent table) As tableSync will finish
> > syncing the table, it will release table lock, So we will release table 
> > locks

RE: Initial Schema Sync for Logical Replication

2023-04-18 Thread Kumar, Sachin
> From: Masahiko Sawada 
> > > While writing a PoC patch, I found some difficulties in this idea.
> > > First, I tried to add schemaname+relname to pg_subscription_rel but
> > > I could not define the primary key of pg_subscription_rel. The
> > > primary key on (srsubid, srrelid) doesn't work since srrelid could be 
> > > NULL.
> > > Similarly, the primary key on (srsubid, srrelid, schemaname,
> > > relname) also doesn't work.
> > >
> >
> > Can we think of having a separate catalog table say
> > pg_subscription_remote_rel for this? You can have srsubid,
> > remote_schema_name, remote_rel_name, etc. We may need some other
> state
> > to be maintained during the initial schema sync where this table can
> > be used. Basically, this can be used to maintain the state till the
> > initial schema sync is complete because we can create a relation entry
> > in pg_subscritption_rel only after the initial schema sync is
> > complete.
> 
> It might not be ideal but I guess it works. But I think we need to modify the 
> name
> of replication slot for initial sync as it currently includes OID of the 
> table:
> 
> void
> ReplicationSlotNameForTablesync(Oid suboid, Oid relid,
> char *syncslotname, Size szslot) {
> snprintf(syncslotname, szslot, "pg_%u_sync_%u_" UINT64_FORMAT, suboid,
>  relid, GetSystemIdentifier()); }
> 
> If we use both schema name and table name, it's possible that slot names are
> duplicated if schema and/or table names are long. Another idea is to use the
> hash value of schema+table names, but it cannot completely eliminate that
> possibility, and probably would make investigation and debugging hard in case
> of any failure. Probably we can use the OID of each entry in
> pg_subscription_remote_rel instead, but I'm not sure it's a good idea, mainly
> because we will end up using twice as many OIDs as before.

Maybe we can create serial primary key for pg_subscription_remote_rel table 
And use this key for creating replication slot ?

Regards
Sachin


RE: Initial Schema Sync for Logical Replication

2023-04-20 Thread Kumar, Sachin
I am working on a prototype with above Idea , and will send it for review by 
Sunday/Monday

Regards
Sachin


Re: Initial Schema Sync for Logical Replication

2023-04-20 Thread Kumar, Sachin
I am working on a prototype with above discussed idea, I think I will send it 
for initial review by Monday.

Regards
Sachin



RE: Initial Schema Sync for Logical Replication

2023-10-20 Thread Kumar, Sachin

> From: vignesh C 
> Sent: Thursday, October 19, 2023 10:41 AM
> Can you rebase the patch and post the complete set of required changes for
> the concurrent DDL, I will have a look at them.

Sure , I will try to send the complete rebased patch within a week.

Regards
Sachin



Re: pg_upgrade failing for 200+ million Large Objects

2023-11-09 Thread Kumar, Sachin




Hi Everyone , I want to continue this thread , I have rebased the patch to 
latest
master and fixed an issue when pg_restore prints to file.

`
╰─$ pg_restore  dump_small.custom  --restore-blob-batch-size=2 --file=a
--
-- End BLOB restore batch
--
COMMIT;
`

> On 09/11/2023, 17:05, "Jacob Champion"  > wrote:
> To clarify, I agree that pg_dump should contain the core fix. What I'm
> questioning is the addition of --dump-options to make use of that fix
> from pg_upgrade, since it also lets the user do "exciting" new things
> like --exclude-schema and --include-foreign-data and so on. I don't
> think we should let them do that without a good reason.

Earlier idea was to not expose these options to users and use [1]
   --restore-jobs=NUM --jobs parameter passed to pg_restore
   --restore-blob-batch-size=NUM  number of blobs restored in one xact
But this was later expanded to use --dump-options and --restore-options [2].
With --restore-options user can use --exclude-schema , 
So maybe we can go back to [1]

[1] 
https://www.postgresql.org/message-id/a1e200e6-adde-2561-422b-a166ec084e3b%40wi3ck.info
[2] 
https://www.postgresql.org/message-id/8d8d3961-8e8b-3dbe-f911-6f418c5fb1d3%40wi3ck.info

Regards
Sachin
Amazon Web Services: https://aws.amazon.com




Re: pg_upgrade failing for 200+ million Large Objects

2023-11-13 Thread Kumar, Sachin
> On 09/11/2023, 18:41, "Tom Lane"  > wrote:
> Um ... you didn't attach the patch?

Sorry , patch attached

Regards
Sachin




pg_upgrade_improvements_v6.diff
Description: pg_upgrade_improvements_v6.diff


Re: pg_upgrade failing for 200+ million Large Objects

2023-12-04 Thread Kumar, Sachin


> "Tom Lane" mailto:t...@sss.pgh.pa.us>> wrote:

> FWIW, I agree with Jacob's concern about it being a bad idea to let
> users of pg_upgrade pass down arbitrary options to pg_dump/pg_restore.
> I think we'd regret going there, because it'd hugely expand the set
> of cases pg_upgrade has to deal with.

> Also, pg_upgrade is often invoked indirectly via scripts, so I do
> not especially buy the idea that we're going to get useful control
> input from some human somewhere. I think we'd be better off to
> assume that pg_upgrade is on its own to manage the process, so that
> if we need to switch strategies based on object count or whatever,
> we should put in a heuristic to choose the strategy automatically.
> It might not be perfect, but that will give better results for the
> pretty large fraction of users who are not going to mess with
> weird little switches.


I have updated the patch to use heuristic, During pg_upgrade we count
Large objects per database. During pg_restore execution if db large_objects
count is greater than LARGE_OBJECTS_THRESOLD (1k) we will use 
--restore-blob-batch-size.
I also modified pg_upgrade --jobs behavior if we have large_objects (> 
LARGE_OBJECTS_THRESOLD)

+  /*  Restore all the dbs where LARGE_OBJECTS_THRESOLD is not breached */
+  restore_dbs(stats, true);
+  /* reap all children */
+  while (reap_child(true) == true)
+ ;
+  /*  Restore rest of the dbs one by one  with pg_restore --jobs = 
user_opts.jobs */
+  restore_dbs(stats, false);
   /* reap all children */
   while (reap_child(true) == true)
  ;

Regards
Sachin











pg_upgrade_improvements_v7.diff
Description: pg_upgrade_improvements_v7.diff


Re: pg_upgrade failing for 200+ million Large Objects

2023-12-07 Thread Kumar, Sachin

> I have updated the patch to use heuristic, During pg_upgrade we count
> Large objects per database. During pg_restore execution if db large_objects
> count is greater than LARGE_OBJECTS_THRESOLD (1k) we will use 
> --restore-blob-batch-size.


I think both SECTION_DATA and SECTION_POST_DATA can be parallelized by 
pg_restore, So instead of storing 
large objects in heuristics, we can store SECTION_DATA + SECTION_POST_DATA.

Regards
Sachin



Re: pg_upgrade failing for 200+ million Large Objects

2024-01-02 Thread Kumar, Sachin
> On 11/12/2023, 01:43, "Tom Lane"  > wrote:

> I had initially supposed that in a parallel restore we could
> have child workers also commit after every N TOC items, but was
> soon disabused of that idea. After a worker processes a TOC
> item, any dependent items (such as index builds) might get
> dispatched to some other worker, which had better be able to
> see the results of the first worker's step. So at least in
> this implementation, we disable the multi-command-per-COMMIT
> behavior during the parallel part of the restore. Maybe that
> could be improved in future, but it seems like it'd add a
> lot more complexity, and it wouldn't make life any better for
> pg_upgrade (which doesn't use parallel pg_restore, and seems
> unlikely to want to in future).

I was not able to find email thread which details why we are not using
parallel pg_restore for pg_upgrade. IMHO most of the customer will have single 
large
database, and not using parallel restore will cause slow pg_upgrade.

I am attaching a patch which enables parallel pg_restore for DATA and POST-DATA 
part
of dump. It will push down --jobs value to pg_restore and will restore database 
sequentially.

Benchmarks

{5 million LOs 1 large DB}
Patched {v9}
time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir 
~/upgrade/data/pub --new-datadir ~/data/sub --jobs=20
pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir  
17.51s user 65.80s system 35% cpu 3:56.64 total


time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir 
~/upgrade/data/pub --new-datadir ~/data/sub -r
pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir  
17.51s user 65.85s system 34% cpu 3:58.39 total


HEAD
time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir 
~/upgrade/data/pub --new-datadir ~/data/sub -r --jobs=20
pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir  
53.95s user 82.44s system 41% cpu 5:25.23 total

time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir 
~/upgrade/data/pub --new-datadir ~/data/sub -r
pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir  
54.94s user 81.26s system 41% cpu 5:24.86 total



Fix with --jobs propagation to pg_restore {on top of v9}
time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir 
~/upgrade/data/pub --new-datadir ~/data/sub -r --jobs=20
pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir  
29.12s user 69.85s system 275% cpu 35.930 total 


Although parallel restore does have small regression in ideal case of 
pg_upgrade --jobs


Multiple DBs {4 DBs each having 2 million LOs}

Fix with --jobs scheduling
time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir 
~/upgrade/data/pub --new-datadir ~/data/sub -r --jobs=4
pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir  
31.80s user 109.52s system 120% cpu 1:57.35 total


Patched {v9}
time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir 
~/upgrade/data/pub --new-datadir ~/data/sub -r --jobs=4
pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir  
30.88s user 110.05s system 135% cpu 1:43.97 total


Regards
Sachin



v9-005-parallel_pg_restore.patch
Description: v9-005-parallel_pg_restore.patch


RE: Initial Schema Sync for Logical Replication

2023-08-31 Thread Kumar, Sachin
Hi Everyone, based on internal discussion with Masahiko
I have implemented concurrent DDL support for initial schema sync.

Concurrent Patch workflow

1. When TableSync worker creates a replicaton slot, It will
save the slot lsn into pg_subscription_rel with
SUBREL_SYNC_SCHEMA_DATA_SYNC state, and it will wait for
its state to be SUBREL_STATE_DATASYNC.

2. Applier process will apply DDLs till tablesync lsn, and then
it will change pg_subscription_rel state to SUBREL_STATE_DATASYNC.

3. TableSync will continue applying pending DML/DDls till it catch up.

This patch needs DDL replication to apply concurrent DDLs, I have cherry-
picked this DDL patch [0]

Issues
1) needs testing for concurrent DDLs, Not sure how to make tablesync process 
wait so that
concurrent DDLs can be issued on publisher.
2) In my testing created table does not appear on the same conenction on 
subscriber,
I have to reconnect to see table.
3) maybe different chars for SUBREL_SYNC_SCHEMA_DATA_INIT and 
SUBREL_SYNC_SCHEMA_DATA_SYNC,
currently they are 'x' and 'y'.
4) I need to add SUBREL_SYNC_SCHEMA_DATA_INIT and SUBREL_SYNC_SCHEMA_DATA_SYNC 
to
pg_subscription_rel_d.h to make it compile succesfully.
5) It only implement concurrent alter as of now

[0] = 
https://www.postgresql.org/message-id/OS0PR01MB57163E6487EFF7378CB8E17C9438A%40OS0PR01MB5716.jpnprd01.prod.outlook.com


ConcurrentDDL.patch
Description: ConcurrentDDL.patch


RE: Initial Schema Sync for Logical Replication

2023-07-10 Thread Kumar, Sachin


> From: Amit Kapila 
> On Wed, Jul 5, 2023 at 7:45 AM Masahiko Sawada
>  wrote:
> >
> > On Mon, Jun 19, 2023 at 5:29 PM Peter Smith 
> wrote:
> > >
> > > Hi,
> > >
> > > Below are my review comments for the PoC patch 0001.
> > >
> > > In addition,  the patch needed rebasing, and, after I rebased it
> > > locally in my private environment there were still test failures:
> > > a) The 'make check' tests fail but only in a minor way due to
> > > changes colname
> > > b) the subscription TAP test did not work at all for me -- many errors.
> >
> > Thank you for reviewing the patch.
> >
> > While updating the patch, I realized that the current approach won't
> > work well or at least has the problem with partition tables. If a
> > publication has a partitioned table with publish_via_root = false, the
> > subscriber launches tablesync workers for its partitions so that each
> > tablesync worker copies data of each partition. Similarly, if it has a
> > partition table with publish_via_root = true, the subscriber launches
> > a tablesync worker for the parent table. With the current design,
> > since the tablesync worker is responsible for both schema and data
> > synchronization for the target table, it won't be possible to
> > synchronize both the parent table's schema and partitions' schema.
> >
> 
> I think one possibility to make this design work is that when publish_via_root
> is false, then we assume that subscriber already has parent table and then
> the individual tablesync workers can sync the schema of partitions and their
> data.

Since publish_via_partition_root is false by default users have to create 
parent table by themselves
which I think is not a good user experience.

> And when publish_via_root is true, then the table sync worker is
> responsible to sync parent and child tables along with data. Do you think
> such a mechanism can address the partition table related cases?
> 


RE: Initial Schema Sync for Logical Replication

2023-07-07 Thread Kumar, Sachin
> From: Masahiko Sawada 
> So I've implemented a different approach; doing schema synchronization at a
> CREATE SUBSCRIPTION time. The backend executing CREATE SUBSCRIPTION
> uses pg_dump and restores the table schemas including both partitioned tables
> and their partitions regardless of publish_via_partition_root option, and then
> creates pg_subscription_rel entries for tables while respecting
> publish_via_partition_root option.
> 
> There is a window between table creations and the tablesync workers starting 
> to
> process the tables. If DDLs are executed in this window, the tablesync worker
> might fail because the table schema might have already been changed. We need
> to mention this note in the documentation. BTW, I think we will be able to get
> rid of this downside if we support DDL replication. DDLs executed in the 
> window
> are applied by the apply worker and it takes over the data copy to the 
> tablesync
> worker at a certain LSN.

I don’t think even with DDL replication we will be able to get rid of this 
window. 
There are some issues
1. Even with tablesync worker taking over at certain LSN, publisher can make 
more changes till
Table sync acquires lock on publisher table via copy table.
2. how we will make sure that applier worker has caught up will all the changes 
from publisher
Before it starts tableSync worker. It can be lag behind publisher.

I think the easiest option would be to just recreate the table , this way we 
don’t have to worry about 
complex race conditions, tablesync already makes a slot for copy data we can 
use same slot for 
getting upto date table definition, dropping the table won't be much expensive 
since there won't be any data
in it.Apply worker will skip all the DDLs/DMLs till table is synced.

Although for partitioned tables we will be able to keep with published table 
schema changes only when 
publish_by_partition_root is true.

Regards
Sachin
Amazon Web Services: https://aws.amazon.com