Re: [I] [Bug] Inconsistent markDeletePosition replication for geo-replicated shared subscriptions with delayed messages [pulsar]

via GitHub Thu, 05 Jun 2025 01:20:03 -0700


lhotari commented on issue #24380:
URL: https://github.com/apache/pulsar/issues/24380#issuecomment-2943211477


   > The only time it eventually advances is after the primary cluster stops 
receiving new messages to that topic — and then, after a delay, the 
markDeletePosition is finally updated in the backup cluster.
   
   it seems that all suitable snapshots have been pushed out of the 
ReplicatedSubscriptionSnapshotCache (max size configurable with 
`replicatedSubscriptionsSnapshotMaxCachedPerSubscription`, default `10`) before 
the mark delete position makes advances. That's what PR  
https://github.com/apache/pulsar/pull/24300 attempts to address.
   
   These are the broker level configuration values for replicated subscriptions:
   
https://github.com/apache/pulsar/blob/a1a2b363cfaa1bbc38933a742484a70a0a56e761/conf/broker.conf#L672-L679
   
   The implementation level details of replicated subscription snapshots are 
described in [PIP-33: Replicated 
subscriptions](https://github.com/apache/pulsar/wiki/PIP-33%3A-Replicated-subscriptions)
 and there's a sequence diagram here: 
https://gist.github.com/lhotari/96fda511a70d7de93744d868b4472b92.
   
   What most likely happens here is that when snapshots are created every 1 
second by default, the snapshot cache will only keep the latest 10 snapshots in 
the cache. When the mark delete position advances on the primary cluster, it 
doesn't find a suitable snapshot in the cache and the impact of that is that 
the mark delete position state isn't updated to the backup cluster.
   Eventually when consumers catch up on the primary cluster, there will be a 
suitable mark delete position which then allows updating the mark delete 
position state to the backup cluster.
   
   PR #24300 by @liudezhi2098 attempts to address the above issue by having a 
better way of evicting snapshots from the cache. Instead pushing out the oldest 
snapshots, it tries to keep snapshots in the cache so that when consumers make 
progress, there would be a suitable snapshot in the cache. 
   
   There has also been reports of problems with replicated subscriptions where 
the mark delete position doesn't get updated when producing has paused and only 
gets updated after new messages have been published and consumed. That's a 
different issue and most likely caused by the current logic where the snapshot 
is omitted if the most recent snapshot was completed after the most recent 
message was published. The condition is not correct since it should be based on 
the starting time of the most recent snapshot that was completed successfully. 
(In addition, the fix is more complicated since it would need to ignore marker 
messages for checking when the most recent message was published. Without that 
it would end up in an loop that doesn't stop.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Bug] Inconsistent markDeletePosition replication for geo-replicated shared subscriptions with delayed messages [pulsar]

Reply via email to