Re: row filtering for logical replication

Tomas Vondra Fri, 24 Sep 2021 14:37:41 -0700



On 9/24/21 8:09 AM, Amit Kapila wrote:

On Thu, Sep 23, 2021 at 6:03 PM Tomas Vondra
<tomas.von...@enterprisedb.com> wrote:


13) turning update into insert

I agree with Ajin Cherian [4] that looking at just old or new row for
updates is not the right solution, because each option will "break" the
replica in some case. So I think the goal "keeping the replica in sync"
is the right perspective, and converting the update to insert/delete if
needed seems appropriate.

This seems a somewhat similar to what pglogical does, because that may
also convert updates (although only to inserts, IIRC) when handling
replication conflicts. The difference is pglogical does all this on the
subscriber, while this makes the decision on the publisher.

I wonder if this might have some negative consequences, or whether
"moving" this to downstream would be useful for other purposes in the
fuure (e.g. it might be reused for handling other conflicts).


Apart from additional traffic, I am not sure how will we handle all
the conditions on subscribers, say if the new row doesn't match, how
will subscribers know about this unless we pass row_filter or some
additional information along with tuple. Previously, I have done some
research and shared in one of the emails above that IBM's InfoSphere
Data Replication [1] performs filtering in this way which also
suggests that we won't be off here.

I'm certainly not suggesting what we're doing is wrong. Given the designof built-in logical replication it makes sense doing it this way, I wasjust thinking aloud about what we might want to do in the future (e.g.pglogical uses this to deal with conflicts between multiple sources, andso on).



15) pgoutput_row_filter initializing filter

I'm not sure I understand why the filter initialization gets moved from
get_rel_sync_entry. Presumably, most of what the replication does is
replicating rows, so I see little point in not initializing this along
with the rest of the rel_sync_entry.


Sorry, IIRC, this has been suggested by me and I thought it was best
to do any expensive computation the first time it is required. I have
shared few cases like in [2] where it would lead to additional cost
without any gain. Unless I am missing something, I don't see any
downside of doing it in a delayed fashion.


Not sure, but the arguments presented there seem a bit wonky ...

Yes, the work would be wasted if we discard the cached data withoutusing it (it might happen for truncate, I'm not sure). But how likely isit that such operations happen *in isolation*? I'd bet the workload isalmost never just a stream of truncates - there are always someoperations in between that would actually use this.

Similarly for the errors - IIRC hitting an error means the replicationrestarts, which is orders of magnitude more expensive than anything wecan save by this delayed evaluation.


I'd keep it simple, for the sake of simplicity of the whole patch.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: row filtering for logical replication

Reply via email to