Dear Community, I hope this email finds you well. I'd like to address an important issue related to Apache Pulsar and discuss a solution I've proposed on GitHub. The problem pertains to the handling of Chunk Messages after enabling deduplication.
In the current version of Apache Pulsar, all chunks of a Chunk Message share the same sequence ID. However, enabling the depublication feature results in an inability to send Chunk Messages. To tackle this problem, I've proposed a solution [1] that ensures messages are not duplicated throughout end-to-end delivery. While this fix addresses the duplication issue for end-to-end messages, there remains a possibility of duplicate chunks within topics. To address this concern, I believe we should introduce a "Chunk ID map" at the Broker level, similar to the existing "sequence ID map", to facilitate effective filtering. However, implementing this has led to a challenge: a producer requires storage for two Long values simultaneously (sequence ID and chunk ID). Because the snapshot of the sequence ID map is stored through the properties of the cursor (Map<String, Long>), so in order to satisfy the storage of two Longs (sequence ID, chunk ID) corresponding to one producer, we hope to add a mark DeleteProperties (Map<String, Long>) String, String>) to replace the properties (Map<String, Long>) field. To resolve this, I've proposed an alternative proposal [2] involving the introduction of a "mark DeleteProperties" (Map<String, String>) to replace the current properties (Map<String, Long>) field. I'd appreciate it if you carefully review both PRs and share your valuable feedback and insights. Thank you immensely for your time and attention. I eagerly anticipate your valuable opinions and recommendations. Warm regards, Xiangying [1] https://github.com/apache/pulsar/pull/20948 [2] https://github.com/apache/pulsar/pull/21027