On 19/08/16 09:34, konstantin knizhnik wrote:

We are using logical decoding in multimaster and we are faced with the
problem that inconsistent transactions are sent to replica.
Briefly, multimaster is using logical decoding in this way:
1. Each multimaster node is connected with each other using logical
decoding channel and so each pair of nodes
has its own replication slot.
2. In normal scenario each replication channel is used to replicate only
those transactions which were originated at the source node.
We are using origin mechanism to skip "foreign" transactions.
When offline cluster node is returned back to the multimaster we need
to recover this node to the current cluster state.
Recovery is performed from one of the cluster's node. So we are using
only one replication channel to receive all (self and foreign) transactions.
Only in this case we can guarantee consistent order of applying
transactions at recovered node.
After the end of recovery we need to recreate replication slots with all
other cluster nodes (because we have already replied transactions from
this nodes).
To restart logical decoding we first drop existed slot, then create new
one and then start logical replication from the WAL position 0/0
(invalid LSN).
In this case recovery should be started from the last consistent point.


I don't think this will work correctly, there will be gap between when the new slot starts to decode and the drop of the old one as the new slot first needs to make snapshot.

Do I understand correctly that you are not using replication origins?

The problem is that for some reasons consistent point is not so
consistent and we get partly decoded transactions.
I.e. transaction body consists of two UPDATE but reorder_buffer extracts
only the one (last) update and sent this truncated transaction to
destination causing consistency violation at replica.  I started
investigation of logical decoding code and found several things which I
do not understand.

Never seen this happen. Do you have more details about what exactly is happening?


Assume that we have transactions T1={start_lsn=100, end_lsn=400} and
T2={start_lsn=200, end_lsn=300}.
Transaction T2 is sent to the replica and replica confirms that
flush_lsn=300.
If now we want to restart logical decoding, we can not start with
position less than 300, because CreateDecodingContext doesn't allow it:

 * start_lsn
 *The LSN at which to start decoding.  If InvalidXLogRecPtr, restart
 *from the slot's confirmed_flush; otherwise, start from the specified
 *location (but move it forwards to confirmed_flush if it's older than
 *that, see below).
 *
else if (start_lsn < slot->data.confirmed_flush)
{
/*
* It might seem like we should error out in this case, but it's
* pretty common for a client to acknowledge a LSN it doesn't have to
* do anything for, and thus didn't store persistently, because the
* xlog records didn't result in anything relevant for logical
* decoding. Clients have to be able to do that to support synchronous
* replication.
*/

So it means that we have no chance to restore T1?
What is worse, if there are valid T2 transaction records with lsn >=
300, then we can partly decode T1 and send this T1' to the replica.
I missed something here?

The decoding starts from restart_lsn of the slot, start_lsn is used for skipping the transactions.

Are there any alternative way to "seek" slot to the proper position
without  actual fetching data from it or recreation of the slot?

You can seek forward just fine, just specify the start position in START_REPLICATION command.

Is there any mechanism in xlog which can enforce consistent decoding of
transaction (so that no transaction records are missed)?
May be I missed something but I didn't find any "record_number" or
something else which can identify first record of transaction.

As I mentioned above, what you probably want to do is use replication origins. When you use those you get origin info when decoding the transaction which you can then send to downstream and it can update it's idea of where it is for that origin. This is especially useful for the transaction forwarding you are doing (See BDR and/or pglogical code for example of that).

--
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to