Hi all, *My top-level query is*: I'm using logical replication under pg 9.6 to do a kind of change data capture and I'm seeing occasional extended periods of significant lag. I'm not sure what conceptual model I'm missing in order to understand why this happens.
*The details:* I'm running Postgres 9.6.19 from the postgres debian apt repos & the wal2json extension. I have a custom client application which essentially executes pg_logical_slot_get_changes() for some manually-created logical replication slot on a loop. I'm monitoring replication lag defined as pg_current_xlog_location() - confirmed_flush_lsn for that slot. What I'm observing is - very occasionally - an extended period (hours long) wherein: * The normal database write load continues or slightly increases * calls to pg_logical_slot_get_changes() return no rows and confirmed_flush_lsn doesn't move * the duration of a call to pg_logical_slot_get_changes() rises linearly over time I understand from the docs and research that this is usually caused by a long-running write transaction, but I notice I'm still confused. * I'm not 100% sure - I'm still confirming - but I'm fairly confident that I don't have any egregiously long write transactions (at least on that scale of hours). Are there any other common scenarios that can result in a similar 'blockage'? e.g some categories of long read-only transactions, or advisory locks, or other kinds of database activity like a vacuum? * Conversely, from experimenting, it seems as if not all long-running write transactions cause pg_logical_slot_get_changes() to be unable to advance. In fact, I'm not able so far to produce a minimal set of simple queries which show that behaviour. Given the following sequence of queries I see changes emitted: -- session 1 begin; insert into foo(bar,baz) values (1, 1); -- session 2 begin; insert into foo(bar,baz) values (2,2); commit; -- session 3 select data from pg_logical_slot_get_changes('example-slot', NULL, NULL, 'format-version', '2'); Session 3 is able to return the row from session 2 despite session 1's ongoing transaction starting first and not yet committing. Can you help me understand (or better yet point me to a resource which explains) the underlying logic defining how logical decoding does in fact get blocked by in-flight transactions? Thanks, Patrick