> 1. SequenceID: 0, ChunkID: 0 > 2. SequenceID: 0, ChunkID: 1 > 3. SequenceID: 0, ChunkID: 0 > 4. SequenceID: 0, ChunkID: 1 > 5. SequenceID: 0, ChunkID: 2 > For the existing behavior, the consumer assembles > messages 3,4,5 into > the original large message. But the changes brought > about by this PIP > will cause the consumer to use messages 1,2,5 for > assembly. There is > no guarantee that the producer will split the message > in the same way > twice before and after. For example, the producer's > maxMessageSize may > be different. This may cause the consumer to > receive a corrupt > message.
Good point. Thanks Yubiao Feng On Wed, Aug 23, 2023 at 12:34 PM Zike Yang <z...@apache.org> wrote: > Hi, xiangying, > > Thanks for your PIP. > > IIUC, this may change the existing behavior and may introduce > inconsistencies. > Suppose that we have a large message with 3 chunks. But the producer > crashes and resends the message after sending the chunk-1. It will > send a total of 5 messages to the Pulsar topic: > > 1. SequenceID: 0, ChunkID: 0 > 2. SequenceID: 0, ChunkID: 1 > 3. SequenceID: 0, ChunkID: 0 -> This message will be dropped > 4. SequenceID: 0, ChunkID: 1 -> Will also be dropped > 5. SequenceID: 0, ChunkID: 2 -> The last chunk of the message > > For the existing behavior, the consumer assembles messages 3,4,5 into > the original large message. But the changes brought about by this PIP > will cause the consumer to use messages 1,2,5 for assembly. There is > no guarantee that the producer will split the message in the same way > twice before and after. For example, the producer's maxMessageSize may > be different. This may cause the consumer to receive a corrupt > message. > > Also, this PIP increases the complexity of handling chunks on the > broker side. Brokers should, in general, treat the chunk as a normal > message. > > I think a simple better approach is to only check the deduplication > for the last chunk of the large message. The consumer only gets the > whole message after receiving the last chunk. We don't need to check > the deduplication for all previous chunks. Also by doing this we only > need bug fixes, we don't need to introduce a new PIP. > > BR, > Zike Yang > > On Fri, Aug 18, 2023 at 7:54 PM Xiangying Meng <xiangy...@apache.org> > wrote: > > > > Dear Community, > > > > I hope this email finds you well. I'd like to address an important > > issue related to Apache Pulsar and discuss a solution I've proposed on > > GitHub. The problem pertains to the handling of Chunk Messages after > > enabling deduplication. > > > > In the current version of Apache Pulsar, all chunks of a Chunk Message > > share the same sequence ID. However, enabling the depublication > > feature results in an inability to send Chunk Messages. To tackle this > > problem, I've proposed a solution [1] that ensures messages are not > > duplicated throughout end-to-end delivery. While this fix addresses > > the duplication issue for end-to-end messages, there remains a > > possibility of duplicate chunks within topics. > > > > To address this concern, I believe we should introduce a "Chunk ID > > map" at the Broker level, similar to the existing "sequence ID map", > > to facilitate effective filtering. However, implementing this has led > > to a challenge: a producer requires storage for two Long values > > simultaneously (sequence ID and chunk ID). Because the snapshot of the > > sequence ID map is stored through the properties of the cursor > > (Map<String, Long>), so in order to satisfy the storage of two Longs > > (sequence ID, chunk ID) corresponding to one producer, we hope to add > > a mark DeleteProperties (Map<String, Long>) String, String>) to > > replace the properties (Map<String, Long>) field. To resolve this, > > I've proposed an alternative proposal [2] involving the introduction > > of a "mark DeleteProperties" (Map<String, String>) to replace the > > current properties (Map<String, Long>) field. > > > > I'd appreciate it if you carefully review both PRs and share your > > valuable feedback and insights. Thank you immensely for your time and > > attention. I eagerly anticipate your valuable opinions and > > recommendations. > > > > Warm regards, > > Xiangying > > > > [1] https://github.com/apache/pulsar/pull/20948 > > [2] https://github.com/apache/pulsar/pull/21027 >