lhotari commented on issue #24380: URL: https://github.com/apache/pulsar/issues/24380#issuecomment-2943211477
> The only time it eventually advances is after the primary cluster stops receiving new messages to that topic — and then, after a delay, the markDeletePosition is finally updated in the backup cluster. it seems that all suitable snapshots have been pushed out of the ReplicatedSubscriptionSnapshotCache (max size configurable with `replicatedSubscriptionsSnapshotMaxCachedPerSubscription`, default `10`) before the mark delete position makes advances. That's what PR https://github.com/apache/pulsar/pull/24300 attempts to address. These are the broker level configuration values for replicated subscriptions: https://github.com/apache/pulsar/blob/a1a2b363cfaa1bbc38933a742484a70a0a56e761/conf/broker.conf#L672-L679 The implementation level details of replicated subscription snapshots are described in [PIP-33: Replicated subscriptions](https://github.com/apache/pulsar/wiki/PIP-33%3A-Replicated-subscriptions) and there's a sequence diagram here: https://gist.github.com/lhotari/96fda511a70d7de93744d868b4472b92. What most likely happens here is that when snapshots are created every 1 second by default, the snapshot cache will only keep the latest 10 snapshots in the cache. When the mark delete position advances on the primary cluster, it doesn't find a suitable snapshot in the cache and the impact of that is that the mark delete position state isn't updated to the backup cluster. Eventually when consumers catch up on the primary cluster, there will be a suitable mark delete position which then allows updating the mark delete position state to the backup cluster. PR #24300 by @liudezhi2098 attempts to address the above issue by having a better way of evicting snapshots from the cache. Instead pushing out the oldest snapshots, it tries to keep snapshots in the cache so that when consumers make progress, there would be a suitable snapshot in the cache. There has also been reports of problems with replicated subscriptions where the mark delete position doesn't get updated when producing has paused and only gets updated after new messages have been published and consumed. That's a different issue and most likely caused by the current logic where the snapshot is omitted if the most recent snapshot was completed after the most recent message was published. The condition is not correct since it should be based on the starting time of the most recent snapshot that was completed successfully. (In addition, the fix is more complicated since it would need to ignore marker messages for checking when the most recent message was published. Without that it would end up in an loop that doesn't stop.) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
