Re: [DISCUSSION] Redesign the MessageId interface

Yunze Xu Wed, 09 Nov 2022 18:34:10 -0800

I also changed my mind after I saw Flink's MesssageIdUtils implementation.

Now it's clear to me that:
- For application users, the APIs in the pulsar-client-api module are
what they should use.
- For Pulsar ecosystem developers, the APIs in the pulsar-client
module are interfaces


So at the moment, these MessageId implementations could already be used in some
external applications.

In conclusion,
1. The MessageId interface should not be touched
2. The public methods of the MessageId implementations should not be touched

Based on these two key points, I'm going to look into these
implementations and mark
some as deprecated but they should still work like before.

Thanks,
Yunze

On Thu, Nov 10, 2022 at 3:50 AM Rajan Dhabalia <[email protected]> wrote:
>
> Hi,
>
> I was reading the email thread why we want to change MessageId interface:
> https://lists.apache.org/thread/rdkqnkohbmkjjs61hvoqplhhngr0b0sd
> >> Currently we have the following 5 implementations of MessageId:
> >> These implementations are such a mess. For example, when users get a
> MessageId from `Producer#send`:
>
> I think above discussion started by stating that MessageId has multiple
> implementations eg: MessageIdImpl, BatchMessageIdImpl, ChunkMessageIdImpl,
> etc.. and as a client we receive a single MessageId with send message API.
>
> Well, I see it's expected and very well defined behavior. No matter what
> implementation client library is internally using, as a user for my topic I
> should be able to publish and consume messages by providing specific
> configurations. The moment a user has to know implementation details of a
> client library such as chunk or batch message internals, there will be a
> strong dependency created between application and server implementation
> which is a RED flag for any system. Abstraction helps user adoption by
> simplifying user API and allowing systems to enhance without worrying about
> application dependency.
> Removing abstraction of MessageId and extracting implementation information
> in getter API will prevent us to make any implementation change (eg: won't
> be able to change chunk mesasgeId behavior which we might need in future
> for Shared-subscription), introduce hacks for any enhancement by making
> sure existing internal implementation can't change (eg: if we can't change
> chunk/batch-message-Id then we might create a new wrapper and attach to
> existing messageId which will make even things worse), and even prevent us
> to change backend implementation (eg: we will be stuck with BK-Ledger
> implementation forever).
> Pulsar is used for many large scale business usecaeses with a large number
> of users and it can create a nightmare for operators and users if we remove
> abstractions and start following the practice of exposing system internals
> to user applications.
>
> Thanks,
> Rajan
>
>
> On Tue, Nov 8, 2022 at 6:05 PM Yunze Xu <[email protected]>
> wrote:
>
> > Hi Joe,
> >
> > Then what would we expect users to do with the MessageId? It should only
> > be passed to Consumer#seek or ReaderBuilder#startMessageId?
> >
> > What about the partition index? We have a `TopicMetadata` interface that
> > returns
> > the number of partitions. If the partition is also "implementation
> > details", should we expose
> > this interface? Or should we support customizing a MessageRouter because it
> > returns the partition index?
> >
> > What about the batch index and batch size? For example, we have an
> > enableBatchIndexAcknowledgment method to enable batch index ACK. If batch
> > index is also "implementation details", how could users know what does
> > "batch
> > index ack" mean?
> >
> > Even for ledger id and entry id, this pair represents a logic storage
> > position like the offset
> > concept in Kafka (though each offset represents a message while each
> > entry represents
> > a batch). If you see the Message API, it also exposes many attributes.
> > IMO, for the
> > MessageIdData, only the ack_set (a long array serialized from the
> > BitSet) is the implementation
> > detail.
> >
> > The MessageId API should be flexible, not an abstract one. If not, why
> > do we still implement
> > the toString() method? We should not encourage users to print the
> > MessageId. It would
> > be easy to know what "ledger is 0, entry id is 1" means, users only
> > need to know the concepts
> > of ledger id and entry id. But it would be harder to know a tuple like
> > "0:1:-1:-1" means.
> >
> > Thanks,
> > Yunze
> >
> > On Tue, Nov 8, 2022 at 11:16 PM Joe F <[email protected]> wrote:
> > >
> > > >Maybe this design is to hidden some details, but if
> > > users don't know the details like ledger id and entry id, how could
> > > you know what does "0:0:-1:0" mean?
> > >
> > >  Abstractions exist for a reason. Ledgerid and entryid are
> > implementation
> > > details, and an application should not be interpreting that at all.
> > > -j
> > >
> > >
> > > On Tue, Nov 8, 2022 at 3:43 AM Yunze Xu <[email protected]>
> > > wrote:
> > >
> > > > I didn't look into these two methods at the moment. But I think it's
> > > > possible to
> > > > retain only the `fromByteArray`.
> > > >
> > > > Thanks,
> > > > Yunze
> > > >
> > > > On Tue, Nov 8, 2022 at 7:02 PM Enrico Olivelli <[email protected]>
> > > > wrote:
> > > > >
> > > > > Il giorno mar 8 nov 2022 alle ore 11:52 Yunze Xu
> > > > > <[email protected]> ha scritto:
> > > > > >
> > > > > > Hi Enrico,
> > > > > >
> > > > > > > We also need a way to represent this as a String or a byte[]
> > > > > >
> > > > > > We already have the `toByteArray` method, right?
> > > > >
> > > > > Yes, correct. So we are fine. I forgot about it and I answered too
> > > > quickly.
> > > > >
> > > > > I am not sure if this can be in the scope of this initiative, but we
> > > > > should somehow get rid of
> > > > > stuff like "fromByteArrayWithTopic" vs "fromByteArray".
> > > > >
> > > > > Thanks
> > > > > Enrico
> > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Yunze
> > > > > >
> > > > > > On Tue, Nov 8, 2022 at 6:43 PM Enrico Olivelli <
> > [email protected]>
> > > > wrote:
> > > > > > >
> > > > > > > Il giorno mar 8 nov 2022 alle ore 11:25 Yunze Xu
> > > > > > > <[email protected]> ha scritto:
> > > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > Currently we have the following 5 implementations of MessageId:
> > > > > > > >
> > > > > > > > - MessageIdImpl: (ledger id, entry id, partition index)
> > > > > > > >   - BatchMessageIdImpl: adds (batch index, batch size, acker),
> > > > where
> > > > > > > >     acker is a wrapper of a BitSet.
> > > > > > > >   - ChunkMessageIdImpl: adds another MessageIdImpl that
> > represents
> > > > > > > >     the first MessageIdImpl of a BitSet.
> > > > > > > >   - MultiMessageIdImpl: adds a map that maps the topic name to
> > the
> > > > > > > >     MessageId.
> > > > > > > > - TopicMessageIdImpl: adds the topic name and the partition
> > name
> > > > > > > >
> > > > > > > > These implementations are such a mess. For example, when users
> > get
> > > > a
> > > > > > > > MessageId from `Producer#send`:
> > > > > > > >
> > > > > > > > ```java
> > > > > > > > var id = producer.send("msg");
> > > > > > > > ```
> > > > > > > >
> > > > > > > > There is no getter to get some specific fields like ledger id.
> > You
> > > > can
> > > > > > > > only see a representation from `toString` method and got some
> > > > output
> > > > > > > > like "0:0:-1:0". Maybe this design is to hidden some details,
> > but
> > > > if
> > > > > > > > users don't know the details like ledger id and entry id, how
> > could
> > > > > > > > you know what does "0:0:-1:0" mean? What if
> > `MessageId#toString`'s
> > > > > > > > implementation changed? Should it be treated as a breaking
> > change?
> > > > > > > >
> > > > > > > > The original definition of the underlying MessageIdData is much
> > > > more
> > > > > > > > clear:
> > > > > > > >
> > > > > > > > ```proto
> > > > > > > > message MessageIdData {
> > > > > > > >     required uint64 ledgerId = 1;
> > > > > > > >     required uint64 entryId  = 2;
> > > > > > > >     optional int32 partition = 3 [default = -1];
> > > > > > > >     optional int32 batch_index = 4 [default = -1];
> > > > > > > >     repeated int64 ack_set = 5;
> > > > > > > >     optional int32 batch_size = 6;
> > > > > > > >
> > > > > > > >     // For the chunk message id, we need to specify the first
> > > > chunk message id.
> > > > > > > >     optional MessageIdData first_chunk_message_id = 7;
> > > > > > > > }
> > > > > > > > ```
> > > > > > > >
> > > > > > > > IMO, MessageId should be a wrapper of MessageIdData. It's more
> > > > natural
> > > > > > > > to have an interface like:
> > > > > > > >
> > > > > > > > ```java
> > > > > > > > interface MessageId {
> > > > > > > >     long ledgerId();
> > > > > > > >     long entryId();
> > > > > > > >     Optional<Integer> partition();
> > > > > > > >     Optional<Integer> batchIndex();
> > > > > > > >     // ...
> > > > > > > > ```
> > > > > > >
> > > > > > > This is very good for client applications.
> > > > > > > We also need a way to represent this as a String or a byte[],
> > this
> > > > way
> > > > > > > client applications have a standard way to store
> > > > > > > message offsets into an external system (for instance when you
> > want
> > > > to
> > > > > > > user the Reader API and keep track of the position by yourself)
> > > > > > >
> > > > > > > Enrico
> > > > > > >
> > > > > > > >
> > > > > > > > Additionally, there are many places that use only the triple of
> > > > > > > > (ledger id, entry id, batch index) as the key to represent the
> > > > position.
> > > > > > > > Currently, they are done by adding a conversion from
> > > > > > > > BatchMessageIdImpl to MessageIdImpl. However, it's more
> > intuitive
> > > > to
> > > > > > > > write something like:
> > > > > > > >
> > > > > > > > ```java
> > > > > > > > class MessageIdPosition implements
> > Comparable<MessageIdPosition> {
> > > > > > > >     private final MessageId messageId;
> > > > > > > >     // TODO: compare only the triple (ledger, entry, batch)
> > > > > > > > ```
> > > > > > > >
> > > > > > > > Therefore, I'm going to write a proposal to redesign the
> > MessageId
> > > > > > > > interface only by adding some getters. Regarding the 5 existing
> > > > > > > > implementations, I think we can drop them because they are a
> > part
> > > > > > > > of `pulsar-client`, not `pulsar-client-api`.
> > > > > > > >
> > > > > > > > Please feel free to share your points.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Yunze
> > > >
> >

Re: [DISCUSSION] Redesign the MessageId interface

Reply via email to