Re: [DISCUSSION] Redesign the MessageId interface

PengHui Li Sun, 13 Nov 2022 19:25:06 -0800

Great discussion!
I agree with the conclusions!

IMO, we should not encourage users to touch the different implementations
of the MessageId.
Instead, we should provide a `MesssageIdUtils` in the client library(not
the interface)
with annotations `@InterfaceAudience.Public @InterfaceStability.Stable`. It
is just for
the specific purpose that users have to access the LedgerId and EntryId.


Thanks,
Penghui

On Sat, Nov 12, 2022 at 6:53 AM Michael Marshall <mmarsh...@apache.org>
wrote:

> Great discussion. I generally agree with the conclusions.
>
> I'll add two points.
>
> Several endpoints in the topics admin api require that message ids are
> consistently serialized and deserialized over HTTP. For example, the
> reset cursor call requires a message id in a correct format or
> "latest" or "earliest". I don't think this is a break in the
> abstraction, but it does imply that the serialization definitions
> (e.g. `toString()`) in each client language need to be consistent.
>
> Second, the admin api exposes the ledgerId/entryId implementation with
> the get message id endpoint, which asks users to specify a ledger id
> and an entry id instead of a serialized message id:
>
>
> https://github.com/apache/pulsar/blob/681d51d54937becc5c436cece3b731150afdf6d1/pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/v2/PersistentTopics.java#L1930-L1944
>
> The Admin Client exposes this endpoint here:
>
>
> https://github.com/apache/pulsar/blob/82237d3684fe506bcb6426b3b23f413422e6e4fb/pulsar-client-admin-api/src/main/java/org/apache/pulsar/client/admin/Topics.java#L1658-L1682
>
> Based on the conclusion of this thread, it seems like we should
> deprecate the affected methods and replace them with ones that take a
> message id instead.
>
> Thanks,
> Michael
>
>
>
> On Wed, Nov 9, 2022 at 8:34 PM Yunze Xu <y...@streamnative.io.invalid>
> wrote:
> >
> > I also changed my mind after I saw Flink's MesssageIdUtils
> implementation.
> >
> > Now it's clear to me that:
> > - For application users, the APIs in the pulsar-client-api module are
> > what they should use.
> > - For Pulsar ecosystem developers, the APIs in the pulsar-client
> > module are interfaces
> >
> > So at the moment, these MessageId implementations could already be used
> in some
> > external applications.
> >
> > In conclusion,
> > 1. The MessageId interface should not be touched
> > 2. The public methods of the MessageId implementations should not be
> touched
> >
> > Based on these two key points, I'm going to look into these
> > implementations and mark
> > some as deprecated but they should still work like before.
> >
> > Thanks,
> > Yunze
> >
> > On Thu, Nov 10, 2022 at 3:50 AM Rajan Dhabalia <rdhaba...@apache.org>
> wrote:
> > >
> > > Hi,
> > >
> > > I was reading the email thread why we want to change MessageId
> interface:
> > > https://lists.apache.org/thread/rdkqnkohbmkjjs61hvoqplhhngr0b0sd
> > > >> Currently we have the following 5 implementations of MessageId:
> > > >> These implementations are such a mess. For example, when users get a
> > > MessageId from `Producer#send`:
> > >
> > > I think above discussion started by stating that MessageId has multiple
> > > implementations eg: MessageIdImpl, BatchMessageIdImpl,
> ChunkMessageIdImpl,
> > > etc.. and as a client we receive a single MessageId with send message
> API.
> > >
> > > Well, I see it's expected and very well defined behavior. No matter
> what
> > > implementation client library is internally using, as a user for my
> topic I
> > > should be able to publish and consume messages by providing specific
> > > configurations. The moment a user has to know implementation details
> of a
> > > client library such as chunk or batch message internals, there will be
> a
> > > strong dependency created between application and server implementation
> > > which is a RED flag for any system. Abstraction helps user adoption by
> > > simplifying user API and allowing systems to enhance without worrying
> about
> > > application dependency.
> > > Removing abstraction of MessageId and extracting implementation
> information
> > > in getter API will prevent us to make any implementation change (eg:
> won't
> > > be able to change chunk mesasgeId behavior which we might need in
> future
> > > for Shared-subscription), introduce hacks for any enhancement by making
> > > sure existing internal implementation can't change (eg: if we can't
> change
> > > chunk/batch-message-Id then we might create a new wrapper and attach to
> > > existing messageId which will make even things worse), and even
> prevent us
> > > to change backend implementation (eg: we will be stuck with BK-Ledger
> > > implementation forever).
> > > Pulsar is used for many large scale business usecaeses with a large
> number
> > > of users and it can create a nightmare for operators and users if we
> remove
> > > abstractions and start following the practice of exposing system
> internals
> > > to user applications.
> > >
> > > Thanks,
> > > Rajan
> > >
> > >
> > > On Tue, Nov 8, 2022 at 6:05 PM Yunze Xu <y...@streamnative.io.invalid>
> > > wrote:
> > >
> > > > Hi Joe,
> > > >
> > > > Then what would we expect users to do with the MessageId? It should
> only
> > > > be passed to Consumer#seek or ReaderBuilder#startMessageId?
> > > >
> > > > What about the partition index? We have a `TopicMetadata` interface
> that
> > > > returns
> > > > the number of partitions. If the partition is also "implementation
> > > > details", should we expose
> > > > this interface? Or should we support customizing a MessageRouter
> because it
> > > > returns the partition index?
> > > >
> > > > What about the batch index and batch size? For example, we have an
> > > > enableBatchIndexAcknowledgment method to enable batch index ACK. If
> batch
> > > > index is also "implementation details", how could users know what
> does
> > > > "batch
> > > > index ack" mean?
> > > >
> > > > Even for ledger id and entry id, this pair represents a logic storage
> > > > position like the offset
> > > > concept in Kafka (though each offset represents a message while each
> > > > entry represents
> > > > a batch). If you see the Message API, it also exposes many
> attributes.
> > > > IMO, for the
> > > > MessageIdData, only the ack_set (a long array serialized from the
> > > > BitSet) is the implementation
> > > > detail.
> > > >
> > > > The MessageId API should be flexible, not an abstract one. If not,
> why
> > > > do we still implement
> > > > the toString() method? We should not encourage users to print the
> > > > MessageId. It would
> > > > be easy to know what "ledger is 0, entry id is 1" means, users only
> > > > need to know the concepts
> > > > of ledger id and entry id. But it would be harder to know a tuple
> like
> > > > "0:1:-1:-1" means.
> > > >
> > > > Thanks,
> > > > Yunze
> > > >
> > > > On Tue, Nov 8, 2022 at 11:16 PM Joe F <joefranc...@gmail.com> wrote:
> > > > >
> > > > > >Maybe this design is to hidden some details, but if
> > > > > users don't know the details like ledger id and entry id, how could
> > > > > you know what does "0:0:-1:0" mean?
> > > > >
> > > > >  Abstractions exist for a reason. Ledgerid and entryid are
> > > > implementation
> > > > > details, and an application should not be interpreting that at all.
> > > > > -j
> > > > >
> > > > >
> > > > > On Tue, Nov 8, 2022 at 3:43 AM Yunze Xu
> <y...@streamnative.io.invalid>
> > > > > wrote:
> > > > >
> > > > > > I didn't look into these two methods at the moment. But I think
> it's
> > > > > > possible to
> > > > > > retain only the `fromByteArray`.
> > > > > >
> > > > > > Thanks,
> > > > > > Yunze
> > > > > >
> > > > > > On Tue, Nov 8, 2022 at 7:02 PM Enrico Olivelli <
> eolive...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > Il giorno mar 8 nov 2022 alle ore 11:52 Yunze Xu
> > > > > > > <y...@streamnative.io.invalid> ha scritto:
> > > > > > > >
> > > > > > > > Hi Enrico,
> > > > > > > >
> > > > > > > > > We also need a way to represent this as a String or a
> byte[]
> > > > > > > >
> > > > > > > > We already have the `toByteArray` method, right?
> > > > > > >
> > > > > > > Yes, correct. So we are fine. I forgot about it and I answered
> too
> > > > > > quickly.
> > > > > > >
> > > > > > > I am not sure if this can be in the scope of this initiative,
> but we
> > > > > > > should somehow get rid of
> > > > > > > stuff like "fromByteArrayWithTopic" vs "fromByteArray".
> > > > > > >
> > > > > > > Thanks
> > > > > > > Enrico
> > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Yunze
> > > > > > > >
> > > > > > > > On Tue, Nov 8, 2022 at 6:43 PM Enrico Olivelli <
> > > > eolive...@gmail.com>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Il giorno mar 8 nov 2022 alle ore 11:25 Yunze Xu
> > > > > > > > > <y...@streamnative.io.invalid> ha scritto:
> > > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > Currently we have the following 5 implementations of
> MessageId:
> > > > > > > > > >
> > > > > > > > > > - MessageIdImpl: (ledger id, entry id, partition index)
> > > > > > > > > >   - BatchMessageIdImpl: adds (batch index, batch size,
> acker),
> > > > > > where
> > > > > > > > > >     acker is a wrapper of a BitSet.
> > > > > > > > > >   - ChunkMessageIdImpl: adds another MessageIdImpl that
> > > > represents
> > > > > > > > > >     the first MessageIdImpl of a BitSet.
> > > > > > > > > >   - MultiMessageIdImpl: adds a map that maps the topic
> name to
> > > > the
> > > > > > > > > >     MessageId.
> > > > > > > > > > - TopicMessageIdImpl: adds the topic name and the
> partition
> > > > name
> > > > > > > > > >
> > > > > > > > > > These implementations are such a mess. For example, when
> users
> > > > get
> > > > > > a
> > > > > > > > > > MessageId from `Producer#send`:
> > > > > > > > > >
> > > > > > > > > > ```java
> > > > > > > > > > var id = producer.send("msg");
> > > > > > > > > > ```
> > > > > > > > > >
> > > > > > > > > > There is no getter to get some specific fields like
> ledger id.
> > > > You
> > > > > > can
> > > > > > > > > > only see a representation from `toString` method and got
> some
> > > > > > output
> > > > > > > > > > like "0:0:-1:0". Maybe this design is to hidden some
> details,
> > > > but
> > > > > > if
> > > > > > > > > > users don't know the details like ledger id and entry
> id, how
> > > > could
> > > > > > > > > > you know what does "0:0:-1:0" mean? What if
> > > > `MessageId#toString`'s
> > > > > > > > > > implementation changed? Should it be treated as a
> breaking
> > > > change?
> > > > > > > > > >
> > > > > > > > > > The original definition of the underlying MessageIdData
> is much
> > > > > > more
> > > > > > > > > > clear:
> > > > > > > > > >
> > > > > > > > > > ```proto
> > > > > > > > > > message MessageIdData {
> > > > > > > > > >     required uint64 ledgerId = 1;
> > > > > > > > > >     required uint64 entryId  = 2;
> > > > > > > > > >     optional int32 partition = 3 [default = -1];
> > > > > > > > > >     optional int32 batch_index = 4 [default = -1];
> > > > > > > > > >     repeated int64 ack_set = 5;
> > > > > > > > > >     optional int32 batch_size = 6;
> > > > > > > > > >
> > > > > > > > > >     // For the chunk message id, we need to specify the
> first
> > > > > > chunk message id.
> > > > > > > > > >     optional MessageIdData first_chunk_message_id = 7;
> > > > > > > > > > }
> > > > > > > > > > ```
> > > > > > > > > >
> > > > > > > > > > IMO, MessageId should be a wrapper of MessageIdData.
> It's more
> > > > > > natural
> > > > > > > > > > to have an interface like:
> > > > > > > > > >
> > > > > > > > > > ```java
> > > > > > > > > > interface MessageId {
> > > > > > > > > >     long ledgerId();
> > > > > > > > > >     long entryId();
> > > > > > > > > >     Optional<Integer> partition();
> > > > > > > > > >     Optional<Integer> batchIndex();
> > > > > > > > > >     // ...
> > > > > > > > > > ```
> > > > > > > > >
> > > > > > > > > This is very good for client applications.
> > > > > > > > > We also need a way to represent this as a String or a
> byte[],
> > > > this
> > > > > > way
> > > > > > > > > client applications have a standard way to store
> > > > > > > > > message offsets into an external system (for instance when
> you
> > > > want
> > > > > > to
> > > > > > > > > user the Reader API and keep track of the position by
> yourself)
> > > > > > > > >
> > > > > > > > > Enrico
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Additionally, there are many places that use only the
> triple of
> > > > > > > > > > (ledger id, entry id, batch index) as the key to
> represent the
> > > > > > position.
> > > > > > > > > > Currently, they are done by adding a conversion from
> > > > > > > > > > BatchMessageIdImpl to MessageIdImpl. However, it's more
> > > > intuitive
> > > > > > to
> > > > > > > > > > write something like:
> > > > > > > > > >
> > > > > > > > > > ```java
> > > > > > > > > > class MessageIdPosition implements
> > > > Comparable<MessageIdPosition> {
> > > > > > > > > >     private final MessageId messageId;
> > > > > > > > > >     // TODO: compare only the triple (ledger, entry,
> batch)
> > > > > > > > > > ```
> > > > > > > > > >
> > > > > > > > > > Therefore, I'm going to write a proposal to redesign the
> > > > MessageId
> > > > > > > > > > interface only by adding some getters. Regarding the 5
> existing
> > > > > > > > > > implementations, I think we can drop them because they
> are a
> > > > part
> > > > > > > > > > of `pulsar-client`, not `pulsar-client-api`.
> > > > > > > > > >
> > > > > > > > > > Please feel free to share your points.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Yunze
> > > > > >
> > > >
>

Re: [DISCUSSION] Redesign the MessageId interface

Reply via email to