Re: long-standing data loss bug in initial sync of logical replication

Tomas Vondra Sat, 18 Nov 2023 12:46:03 -0800

On 11/18/23 19:12, Andres Freund wrote:
> Hi,
> 
> On 2023-11-18 11:56:47 +0100, Tomas Vondra wrote:
>>> I guess it's not really feasible to just increase the lock level here though
>>> :(. The use of ShareUpdateExclusiveLock isn't new, and suddenly using AEL
>>> would perhaps lead to new deadlocks and such? But it also seems quite wrong.
>>>
>>
>> If this really is about the lock being too weak, then I don't see why
>> would it be wrong?
> 
> Sorry, that was badly formulated. The wrong bit is the use of
> ShareUpdateExclusiveLock.
>


Ah, you meant the current lock mode seems wrong, not that changing the
locks seems wrong. Yeah, true.

> 
>> If it's required for correctness, it's not really wrong, IMO. Sure, stronger
>> locks are not great ...
>>
>> I'm not sure about the risk of deadlocks. If you do
>>
>>     ALTER PUBLICATION ... ADD TABLE
>>
>> it's not holding many other locks. It essentially gets a lock just a
>> lock on pg_publication catalog, and then the publication row. That's it.
>>
>> If we increase the locks from ShareUpdateExclusive to ShareRowExclusive,
>> we're making it conflict with RowExclusive. Which is just DML, and I
>> think we need to do that.
> 
> From what I can tell it needs to to be an AccessExlusiveLock. Completely
> independent of logical decoding. The way the cache stays coherent is catalog
> modifications conflicting with anything that builds cache entries. We have a
> few cases where we do use lower level locks, but for those we have explicit
> analysis for why that's ok (see e.g. reloptions.c) or we block until nobody
> could have an old view of the catalog (various CONCURRENTLY) operations.
> 

Yeah, I got too focused on the issue I triggered, which seems to be
fixed by using SRE (still don't understand why ...). But you're probably
right there may be other cases where SRE would not be sufficient, I
certainly can't prove it'd be safe.

> 
>> So maybe that's fine? For me, a detected deadlock is better than
>> silently missing some of the data.
> 
> That certainly is true.
> 
> 
>>> We could brute force this in the logical decoding infrastructure, by
>>> distributing invalidations from catalog modifying transactions to all
>>> concurrent in-progress transactions (like already done for historic catalog
>>> snapshot, c.f. SnapBuildDistributeNewCatalogSnapshot()).  But I think that'd
>>> be a fairly significant increase in overhead.
>>>
>>
>> I have no idea what the overhead would be - perhaps not too bad,
>> considering catalog changes are not too common (I'm sure there are
>> extreme cases). And maybe we could even restrict this only to
>> "interesting" catalogs, or something like that? (However I hate those
>> weird differences in behavior, it can easily lead to bugs.)
>>
>> But it feels more like a band-aid than actually fixing the issue.
> 
> Agreed.
> 

... and it would no not fix the other places outside logical decoding.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: long-standing data loss bug in initial sync of logical replication

Reply via email to