tarmacmonsterg opened a new issue, #24380: URL: https://github.com/apache/pulsar/issues/24380
### Search before reporting - [x] I searched in the [issues](https://github.com/apache/pulsar/issues) and found nothing similar. ### Read release policy - [x] I understand that [unsupported versions](https://pulsar.apache.org/contribute/release-policy/#supported-versions) don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker. ### User environment Pulsar: 4.0.4 official docker image Deployed on K8S ### Issue Description We have several topics in our Pulsar deployment. For some topics (cache-related), we have geo-replication disabled. Others work as expected — the subscription cursor is replicated to the backup cluster. However, we are seeing inconsistent behavior with topics used for delayed messages and shared subscriptions. These topics have geo-replication enabled and use individual acknowledgments. According to the documentation, individual acknowledgments themselves are not replicated across clusters. However, the markDeletePosition should be replicated. In our tests, we noticed that the markDeletePosition in the backup cluster does not move predictably. In some cases, it remains unchanged for a long time. The only time it eventually advances is after the primary cluster stops receiving new messages to that topic — and then, after a delay, the markDeletePosition is finally updated in the backup cluster. First check stats-internal main cluster ``` "delayed_message_10_min" : { "markDeletePosition" : "2382448:33718", ``` backup cluster ``` "delayed_message_10_min" : { "markDeletePosition" : "50797:509", ``` Second check main cluster ``` "delayed_message_10_min" : { "markDeletePosition" : "2382448:40268", ``` backup cluster ``` "delayed_message_10_min" : { "markDeletePosition" : "50797:509", ``` third check main cluster ``` "delayed_message_10_min" : { "markDeletePosition" : "2382722:21942", ``` backup cluster ``` "delayed_message_10_min" : { "markDeletePosition" : "50797:509", ``` and check after stop load tests and empty backlog in main cluster main ``` "delayed_message_10_min" : { "markDeletePosition" : "2382761:11155", ``` backup ``` "delayed_message_10_min" : { "markDeletePosition" : "54807:11139", ``` And i see one difference. In main clusters disappear individuallyDeletedMessages after stooping load test. ### Error messages ```text ``` ### Reproducing the issue 1. Deploy two Pulsar clusters. 2. Create the relevant topics. 3. Configure geo-replication between the clusters. 4. Enable subscription replication on the client. 5. Start continuously producing delayed messages to the topic, with delivery delays of up to 10 minutes. 6. On the primary cluster, consume messages selectively (based on delivery time). Expected behavior: The markDeletePosition should advance on both the primary and the backup clusters. Actual behavior: The markDeletePosition advances only on the primary cluster. On the backup cluster, a backlog accumulates and markDeletePosition remains stuck for a long time. ### Additional information Disscussion started here: https://apache-pulsar.slack.com/archives/C5Z4T36F7/p1748598297819549 ### Are you willing to submit a PR? - [ ] I'm willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
