Hi all,

*My top-level query is*: I'm using logical replication under pg 9.6 to do a
kind of change data capture and I'm seeing occasional extended periods of
significant lag. I'm not sure what conceptual model I'm missing in order to
understand why this happens.

*The details:*

I'm running Postgres 9.6.19 from the postgres debian apt repos & the
wal2json extension.

I have a custom client application which essentially executes
pg_logical_slot_get_changes() for some manually-created logical replication
slot on a loop.

I'm monitoring replication lag defined as pg_current_xlog_location() -
confirmed_flush_lsn for that slot.

What I'm observing is - very occasionally - an extended period (hours long)
wherein:

* The normal database write load continues or slightly increases
* calls to pg_logical_slot_get_changes() return no rows and
confirmed_flush_lsn doesn't move
* the duration of a call to pg_logical_slot_get_changes() rises linearly
over time

I understand from the docs and research that this is usually caused by a
long-running write transaction, but I notice I'm still confused.

* I'm not 100% sure - I'm still confirming - but I'm fairly confident that
I don't have any egregiously long write transactions (at least on that
scale of hours). Are there any other common scenarios that can result in a
similar 'blockage'? e.g some categories of long read-only transactions, or
advisory locks, or other kinds of database activity like a vacuum?

* Conversely, from experimenting, it seems as if not all long-running write
transactions cause pg_logical_slot_get_changes() to be unable to advance.
In fact, I'm not able so far to produce a minimal set of simple queries
which show that behaviour.

Given the following sequence of queries I see changes emitted:

-- session 1
begin;
insert into foo(bar,baz) values (1, 1);

-- session 2
begin;
insert into foo(bar,baz) values (2,2);
commit;

-- session 3
 select data from
pg_logical_slot_get_changes('example-slot', NULL, NULL, 'format-version',
'2');

Session 3 is able to return the row from session 2 despite session 1's
ongoing transaction starting first and not yet committing. Can you help me
understand (or better yet point me to a resource which explains) the
underlying logic defining how logical decoding does in fact get blocked by
in-flight transactions?

Thanks,
Patrick

Reply via email to