Re: [DISCUSS]PIP-295: Fixing Chunk Message Duplication Issue

Zike Yang Wed, 23 Aug 2023 23:21:23 -0700

Hi, xiangying

> it will find that the message
is out of order and rewind the cursor. Loop this operation, and
discard this message after it expires instead of assembling 3, 4, 5
into a message.


Could you point out where the implementation for this?  From my
understanding, there should not be any rewind operation for the
chunking feature. You can check more detail here:
https://streamnative.io/blog/deep-dive-into-message-chunking-in-pulsar#how-message-chunking-is-implemented

> This solution also cannot solve the out-of-order messages inside the
chunks. For example, the above five messages will still be persisted.

The consumer already handles this case. The above 5 messages will all
be persisted but the consumer will skip message 1 and 2.
For messages 3, 4, and 5. The producer can guarantee these chunks are in order.

BR,
Zike Yang

On Thu, Aug 24, 2023 at 11:48 AM Yubiao Feng
<yubiao.f...@streamnative.io.invalid> wrote:
>
> > 1. SequenceID: 0, ChunkID: 0
> > 2. SequenceID: 0, ChunkID: 1
> > 3. SequenceID: 0, ChunkID: 0
> > 4. SequenceID: 0, ChunkID: 1
> > 5. SequenceID: 0, ChunkID: 2
> > For the existing behavior, the consumer assembles
> > messages 3,4,5 into
> > the original large message. But the changes brought
> > about by this PIP
> > will cause the consumer to use messages 1,2,5 for
> > assembly. There is
> > no guarantee that the producer will split the message
> > in the same way
> > twice before and after. For example, the producer's
> > maxMessageSize may
> > be different. This may cause the consumer to
> > receive a corrupt
> > message.
>
> Good point.
>
>
> Thanks
> Yubiao Feng
>
> On Wed, Aug 23, 2023 at 12:34 PM Zike Yang <z...@apache.org> wrote:
>
> > Hi, xiangying,
> >
> > Thanks for your PIP.
> >
> > IIUC, this may change the existing behavior and may introduce
> > inconsistencies.
> > Suppose that we have a large message with 3 chunks. But the producer
> > crashes and resends the message after sending the chunk-1. It will
> > send a total of 5 messages to the Pulsar topic:
> >
> > 1. SequenceID: 0, ChunkID: 0
> > 2. SequenceID: 0, ChunkID: 1
> > 3. SequenceID: 0, ChunkID: 0   -> This message will be dropped
> > 4. SequenceID: 0, ChunkID: 1    -> Will also be dropped
> > 5. SequenceID: 0, ChunkID: 2    -> The last chunk of the message
> >
> > For the existing behavior, the consumer assembles messages 3,4,5 into
> > the original large message. But the changes brought about by this PIP
> > will cause the consumer to use messages 1,2,5 for assembly. There is
> > no guarantee that the producer will split the message in the same way
> > twice before and after. For example, the producer's maxMessageSize may
> > be different. This may cause the consumer to receive a corrupt
> > message.
> >
> > Also, this PIP increases the complexity of handling chunks on the
> > broker side. Brokers should, in general, treat the chunk as a normal
> > message.
> >
> > I think a simple better approach is to only check the deduplication
> > for the last chunk of the large message. The consumer only gets the
> > whole message after receiving the last chunk. We don't need to check
> > the deduplication for all previous chunks. Also by doing this we only
> > need bug fixes, we don't need to introduce a new PIP.
> >
> > BR,
> > Zike Yang
> >
> > On Fri, Aug 18, 2023 at 7:54 PM Xiangying Meng <xiangy...@apache.org>
> > wrote:
> > >
> > > Dear Community,
> > >
> > > I hope this email finds you well. I'd like to address an important
> > > issue related to Apache Pulsar and discuss a solution I've proposed on
> > > GitHub. The problem pertains to the handling of Chunk Messages after
> > > enabling deduplication.
> > >
> > > In the current version of Apache Pulsar, all chunks of a Chunk Message
> > > share the same sequence ID. However, enabling the depublication
> > > feature results in an inability to send Chunk Messages. To tackle this
> > > problem, I've proposed a solution [1] that ensures messages are not
> > > duplicated throughout end-to-end delivery. While this fix addresses
> > > the duplication issue for end-to-end messages, there remains a
> > > possibility of duplicate chunks within topics.
> > >
> > > To address this concern, I believe we should introduce a "Chunk ID
> > > map" at the Broker level, similar to the existing "sequence ID map",
> > > to facilitate effective filtering. However, implementing this has led
> > > to a challenge: a producer requires storage for two Long values
> > > simultaneously (sequence ID and chunk ID). Because the snapshot of the
> > > sequence ID map is stored through the properties of the cursor
> > > (Map<String, Long>), so in order to satisfy the storage of two Longs
> > > (sequence ID, chunk ID) corresponding to one producer, we hope to add
> > > a mark DeleteProperties (Map<String, Long>) String, String>) to
> > > replace the properties (Map<String, Long>) field. To resolve this,
> > > I've proposed an alternative proposal [2] involving the introduction
> > > of a "mark DeleteProperties" (Map<String, String>) to replace the
> > > current properties (Map<String, Long>) field.
> > >
> > > I'd appreciate it if you carefully review both PRs and share your
> > > valuable feedback and insights. Thank you immensely for your time and
> > > attention. I eagerly anticipate your valuable opinions and
> > > recommendations.
> > >
> > > Warm regards,
> > > Xiangying
> > >
> > > [1] https://github.com/apache/pulsar/pull/20948
> > > [2] https://github.com/apache/pulsar/pull/21027
> >

Re: [DISCUSS]PIP-295: Fixing Chunk Message Duplication Issue

Reply via email to