Hi, I've previously started two threads about replication identifiers. Check http://archives.postgresql.org/message-id/20131114172632.GE7522%40alap2.anarazel.de and http://archives.postgresql.org/message-id/20131211153833.GB25227%40awork2.anarazel.de .
The've also been discussed in the course of another thread: http://archives.postgresql.org/message-id/20140617165011.GA3115%40awork2.anarazel.de As the topic has garnered some heat and confusion I thought it'd be worthwile to start afresh with an explanation why I think they're useful. I don't really want to discuss about implementation specifics for now, but rather about (details of the) concept. Once we've hashed those out, I'll adapt the existing patch to match them. There are three primary use cases for replication identifiers: 1) The ability Monitor how for replication has progressed in a crashsafe manner to allow it to restart at the right point after errors/crashes. 2) Efficiently identify the origin of individual changes and transactions. In multimaster and some cascading scenarios it is necessary to do so to avoid sending out replayed changes again. 3) Allow to efficiently filter out replayed changes from logical decoding. It's currently possible to filter changes from inside the output plugin, but it's much more efficient to filter them out before decoding. == Logical Decoding Background == To understand the need for 1) it's important to roughly understand how logical decoding/walsender streams changes and handles feedback from the receiving side. A walsender performing logical decoding *continously* sends out transactions. As long as there's new local changes (i.e. unprocessed WAL) and the network buffers aren't full it will send changes. *Without* waiting for the client. Everything else would lead to horrible latency. Because it sends data without waiting for the client to have processed them it obviously can't remove resources that are needed to stream them out again. The client or network connection could crash after all. To let the sender know when it can remove resources the receiver regularly sends back 'feedback' messages acknowledging up to where changes have been safely received. Whenever such a feedback message arrives the sender can release resources that aren't needed to decode the changes below that horizon. When the receiver ask the server to stream changes out it tells the sender at which LSN it should start sending changes. All *transactions* that *commit* after that LSN are sent out. Possibly again. == Crashsafe apply == Based on those explanations, when building a logical replication solution on top of logical decoding, one must remember the latest *remote* LSN that already has been replayed. So that, when the apply process or the whole database crashes, it is possibly to ask for all changes since the last transaction that has been successfully applied. The trivial solution here is to have a table (remote_node, last_replayed_lsn) and update it for every replayed transaction. Unfortunately that doesn't perform very well because that table quickly gets heavily bloated. It's also hard to avoid page level contention when replaying transaction from multiple remote nodes. Additionally these changes have to be filtered out when replicating these changes in a cascading fashion. To do this more efficiently there needs to be a crashsafe way to associate the latest successfully replayed remote transaction. == Identify the origin of changes == Say you're building a replication solution that allows two nodes to insert into the same table on two nodes. Ignoring conflict resolution and similar fun, one needs to prevent the same change being replayed over and over. In logical replication the changes to the heap have to be WAL logged, and thus the *replay* of changes from a remote node produce WAL which then will be decoded again. To avoid that it's very useful to tag individual changes/transactions with their 'origin'. I.e. mark changes that have been directly triggered by the user sending SQL as originating 'locally' and changes originating from replaying another node's changes as originating somewhere else. If that origin is exposed to logical decoding output plugins they can easily check whether to stream out the changes/transactions or not. It is possible to do this by adding extra columns to every table and store the origin of a row in there, but that a) permanently needs storage b) makes things much more invasive. == Proposed solution == These two fundamental problems happen to have overlapping requirements. A rather efficient solution for 1) is to attach the 'origin node' and the remote commit LSN to every local commit record produced by replay. That allows to have a shared memory "table" (remote_node, local_lsn, remote_lsn). During replay that table is kept up2date in sync with transaction commits. If updated within the transaction commit's critical section it's guaranteed to be correct, even if transactions can abort due to constraint violations and such. When the cluster crashes it can be rebuilt during crash recovery, by updating values whenever a commit record is read. The primary complexity here is that the identification of the 'origin' node should be as small as possible to keep the WAL volume down. Similarly, to solve the problem of identifying the origin of changes during decoding, the problem can be solved nicely by adding the origin node of every change to changes/transactions. At first it might seem to be sufficient to do so on transaction level, but for cascading scenarios it's very useful to be able to have changes from different source transactions combinded into a larger one. Again the primary problem here is how to efficiently identify origin nodes. == Replication Identifiers == The above explains the need to have as small as possible identifiers for nodes. Two years back I'd suggested that we rely on the user to manually assign 16bit ids to individual nodes. Not very surprisingly that was shot down because a) 16bit numbers are not descriptive b) a per node identifier is problematic because it prohibits replication inside the same cluster. So, what I've proposed since is to have two different forms of identifiers. A long one, that's as descriptive as $replication_solution wants. And a small one (16bit in my case) that's *only meaningful within one node*. The long, user facing, identifier is locally mapped to the short one. In the first version I proposed these long identifiers had a fixed form, including the system identifier, timeline id, database id, and a freeform name. That wasn't well received and I agree that that's too narrow. I think it should be a freeform text of arbitrary length. Note that it depends on the replication solution whether these external identifiers need to be coordinated across systems or not. I think it's *good* if we don't propose a solution for that - different replication solutions will have different requirements. What I've previously suggested (and which works well in BDR) is to add the internal id to the XLogRecord struct. There's 2 free bytes of padding that can be used for that purpose. There's a example of how this can be used from SQL at http://archives.postgresql.org/message-id/20131114172632.ge7...@alap2.anarazel.de That version is built on top of commit timestamps, but that only shows because pg_replication_identifier_setup_tx_origin() allows to set the source transaction's timestamp. With that, far too long, explanation, is it clearer what I think replication identifiers are for? What's your thougts? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers