Great discussion! I agree with the conclusions! IMO, we should not encourage users to touch the different implementations of the MessageId. Instead, we should provide a `MesssageIdUtils` in the client library(not the interface) with annotations `@InterfaceAudience.Public @InterfaceStability.Stable`. It is just for the specific purpose that users have to access the LedgerId and EntryId.
Thanks, Penghui On Sat, Nov 12, 2022 at 6:53 AM Michael Marshall <mmarsh...@apache.org> wrote: > Great discussion. I generally agree with the conclusions. > > I'll add two points. > > Several endpoints in the topics admin api require that message ids are > consistently serialized and deserialized over HTTP. For example, the > reset cursor call requires a message id in a correct format or > "latest" or "earliest". I don't think this is a break in the > abstraction, but it does imply that the serialization definitions > (e.g. `toString()`) in each client language need to be consistent. > > Second, the admin api exposes the ledgerId/entryId implementation with > the get message id endpoint, which asks users to specify a ledger id > and an entry id instead of a serialized message id: > > > https://github.com/apache/pulsar/blob/681d51d54937becc5c436cece3b731150afdf6d1/pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/v2/PersistentTopics.java#L1930-L1944 > > The Admin Client exposes this endpoint here: > > > https://github.com/apache/pulsar/blob/82237d3684fe506bcb6426b3b23f413422e6e4fb/pulsar-client-admin-api/src/main/java/org/apache/pulsar/client/admin/Topics.java#L1658-L1682 > > Based on the conclusion of this thread, it seems like we should > deprecate the affected methods and replace them with ones that take a > message id instead. > > Thanks, > Michael > > > > On Wed, Nov 9, 2022 at 8:34 PM Yunze Xu <y...@streamnative.io.invalid> > wrote: > > > > I also changed my mind after I saw Flink's MesssageIdUtils > implementation. > > > > Now it's clear to me that: > > - For application users, the APIs in the pulsar-client-api module are > > what they should use. > > - For Pulsar ecosystem developers, the APIs in the pulsar-client > > module are interfaces > > > > So at the moment, these MessageId implementations could already be used > in some > > external applications. > > > > In conclusion, > > 1. The MessageId interface should not be touched > > 2. The public methods of the MessageId implementations should not be > touched > > > > Based on these two key points, I'm going to look into these > > implementations and mark > > some as deprecated but they should still work like before. > > > > Thanks, > > Yunze > > > > On Thu, Nov 10, 2022 at 3:50 AM Rajan Dhabalia <rdhaba...@apache.org> > wrote: > > > > > > Hi, > > > > > > I was reading the email thread why we want to change MessageId > interface: > > > https://lists.apache.org/thread/rdkqnkohbmkjjs61hvoqplhhngr0b0sd > > > >> Currently we have the following 5 implementations of MessageId: > > > >> These implementations are such a mess. For example, when users get a > > > MessageId from `Producer#send`: > > > > > > I think above discussion started by stating that MessageId has multiple > > > implementations eg: MessageIdImpl, BatchMessageIdImpl, > ChunkMessageIdImpl, > > > etc.. and as a client we receive a single MessageId with send message > API. > > > > > > Well, I see it's expected and very well defined behavior. No matter > what > > > implementation client library is internally using, as a user for my > topic I > > > should be able to publish and consume messages by providing specific > > > configurations. The moment a user has to know implementation details > of a > > > client library such as chunk or batch message internals, there will be > a > > > strong dependency created between application and server implementation > > > which is a RED flag for any system. Abstraction helps user adoption by > > > simplifying user API and allowing systems to enhance without worrying > about > > > application dependency. > > > Removing abstraction of MessageId and extracting implementation > information > > > in getter API will prevent us to make any implementation change (eg: > won't > > > be able to change chunk mesasgeId behavior which we might need in > future > > > for Shared-subscription), introduce hacks for any enhancement by making > > > sure existing internal implementation can't change (eg: if we can't > change > > > chunk/batch-message-Id then we might create a new wrapper and attach to > > > existing messageId which will make even things worse), and even > prevent us > > > to change backend implementation (eg: we will be stuck with BK-Ledger > > > implementation forever). > > > Pulsar is used for many large scale business usecaeses with a large > number > > > of users and it can create a nightmare for operators and users if we > remove > > > abstractions and start following the practice of exposing system > internals > > > to user applications. > > > > > > Thanks, > > > Rajan > > > > > > > > > On Tue, Nov 8, 2022 at 6:05 PM Yunze Xu <y...@streamnative.io.invalid> > > > wrote: > > > > > > > Hi Joe, > > > > > > > > Then what would we expect users to do with the MessageId? It should > only > > > > be passed to Consumer#seek or ReaderBuilder#startMessageId? > > > > > > > > What about the partition index? We have a `TopicMetadata` interface > that > > > > returns > > > > the number of partitions. If the partition is also "implementation > > > > details", should we expose > > > > this interface? Or should we support customizing a MessageRouter > because it > > > > returns the partition index? > > > > > > > > What about the batch index and batch size? For example, we have an > > > > enableBatchIndexAcknowledgment method to enable batch index ACK. If > batch > > > > index is also "implementation details", how could users know what > does > > > > "batch > > > > index ack" mean? > > > > > > > > Even for ledger id and entry id, this pair represents a logic storage > > > > position like the offset > > > > concept in Kafka (though each offset represents a message while each > > > > entry represents > > > > a batch). If you see the Message API, it also exposes many > attributes. > > > > IMO, for the > > > > MessageIdData, only the ack_set (a long array serialized from the > > > > BitSet) is the implementation > > > > detail. > > > > > > > > The MessageId API should be flexible, not an abstract one. If not, > why > > > > do we still implement > > > > the toString() method? We should not encourage users to print the > > > > MessageId. It would > > > > be easy to know what "ledger is 0, entry id is 1" means, users only > > > > need to know the concepts > > > > of ledger id and entry id. But it would be harder to know a tuple > like > > > > "0:1:-1:-1" means. > > > > > > > > Thanks, > > > > Yunze > > > > > > > > On Tue, Nov 8, 2022 at 11:16 PM Joe F <joefranc...@gmail.com> wrote: > > > > > > > > > > >Maybe this design is to hidden some details, but if > > > > > users don't know the details like ledger id and entry id, how could > > > > > you know what does "0:0:-1:0" mean? > > > > > > > > > > Abstractions exist for a reason. Ledgerid and entryid are > > > > implementation > > > > > details, and an application should not be interpreting that at all. > > > > > -j > > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at 3:43 AM Yunze Xu > <y...@streamnative.io.invalid> > > > > > wrote: > > > > > > > > > > > I didn't look into these two methods at the moment. But I think > it's > > > > > > possible to > > > > > > retain only the `fromByteArray`. > > > > > > > > > > > > Thanks, > > > > > > Yunze > > > > > > > > > > > > On Tue, Nov 8, 2022 at 7:02 PM Enrico Olivelli < > eolive...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > > Il giorno mar 8 nov 2022 alle ore 11:52 Yunze Xu > > > > > > > <y...@streamnative.io.invalid> ha scritto: > > > > > > > > > > > > > > > > Hi Enrico, > > > > > > > > > > > > > > > > > We also need a way to represent this as a String or a > byte[] > > > > > > > > > > > > > > > > We already have the `toByteArray` method, right? > > > > > > > > > > > > > > Yes, correct. So we are fine. I forgot about it and I answered > too > > > > > > quickly. > > > > > > > > > > > > > > I am not sure if this can be in the scope of this initiative, > but we > > > > > > > should somehow get rid of > > > > > > > stuff like "fromByteArrayWithTopic" vs "fromByteArray". > > > > > > > > > > > > > > Thanks > > > > > > > Enrico > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Yunze > > > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at 6:43 PM Enrico Olivelli < > > > > eolive...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > > > > > > Il giorno mar 8 nov 2022 alle ore 11:25 Yunze Xu > > > > > > > > > <y...@streamnative.io.invalid> ha scritto: > > > > > > > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > > > > > > > Currently we have the following 5 implementations of > MessageId: > > > > > > > > > > > > > > > > > > > > - MessageIdImpl: (ledger id, entry id, partition index) > > > > > > > > > > - BatchMessageIdImpl: adds (batch index, batch size, > acker), > > > > > > where > > > > > > > > > > acker is a wrapper of a BitSet. > > > > > > > > > > - ChunkMessageIdImpl: adds another MessageIdImpl that > > > > represents > > > > > > > > > > the first MessageIdImpl of a BitSet. > > > > > > > > > > - MultiMessageIdImpl: adds a map that maps the topic > name to > > > > the > > > > > > > > > > MessageId. > > > > > > > > > > - TopicMessageIdImpl: adds the topic name and the > partition > > > > name > > > > > > > > > > > > > > > > > > > > These implementations are such a mess. For example, when > users > > > > get > > > > > > a > > > > > > > > > > MessageId from `Producer#send`: > > > > > > > > > > > > > > > > > > > > ```java > > > > > > > > > > var id = producer.send("msg"); > > > > > > > > > > ``` > > > > > > > > > > > > > > > > > > > > There is no getter to get some specific fields like > ledger id. > > > > You > > > > > > can > > > > > > > > > > only see a representation from `toString` method and got > some > > > > > > output > > > > > > > > > > like "0:0:-1:0". Maybe this design is to hidden some > details, > > > > but > > > > > > if > > > > > > > > > > users don't know the details like ledger id and entry > id, how > > > > could > > > > > > > > > > you know what does "0:0:-1:0" mean? What if > > > > `MessageId#toString`'s > > > > > > > > > > implementation changed? Should it be treated as a > breaking > > > > change? > > > > > > > > > > > > > > > > > > > > The original definition of the underlying MessageIdData > is much > > > > > > more > > > > > > > > > > clear: > > > > > > > > > > > > > > > > > > > > ```proto > > > > > > > > > > message MessageIdData { > > > > > > > > > > required uint64 ledgerId = 1; > > > > > > > > > > required uint64 entryId = 2; > > > > > > > > > > optional int32 partition = 3 [default = -1]; > > > > > > > > > > optional int32 batch_index = 4 [default = -1]; > > > > > > > > > > repeated int64 ack_set = 5; > > > > > > > > > > optional int32 batch_size = 6; > > > > > > > > > > > > > > > > > > > > // For the chunk message id, we need to specify the > first > > > > > > chunk message id. > > > > > > > > > > optional MessageIdData first_chunk_message_id = 7; > > > > > > > > > > } > > > > > > > > > > ``` > > > > > > > > > > > > > > > > > > > > IMO, MessageId should be a wrapper of MessageIdData. > It's more > > > > > > natural > > > > > > > > > > to have an interface like: > > > > > > > > > > > > > > > > > > > > ```java > > > > > > > > > > interface MessageId { > > > > > > > > > > long ledgerId(); > > > > > > > > > > long entryId(); > > > > > > > > > > Optional<Integer> partition(); > > > > > > > > > > Optional<Integer> batchIndex(); > > > > > > > > > > // ... > > > > > > > > > > ``` > > > > > > > > > > > > > > > > > > This is very good for client applications. > > > > > > > > > We also need a way to represent this as a String or a > byte[], > > > > this > > > > > > way > > > > > > > > > client applications have a standard way to store > > > > > > > > > message offsets into an external system (for instance when > you > > > > want > > > > > > to > > > > > > > > > user the Reader API and keep track of the position by > yourself) > > > > > > > > > > > > > > > > > > Enrico > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Additionally, there are many places that use only the > triple of > > > > > > > > > > (ledger id, entry id, batch index) as the key to > represent the > > > > > > position. > > > > > > > > > > Currently, they are done by adding a conversion from > > > > > > > > > > BatchMessageIdImpl to MessageIdImpl. However, it's more > > > > intuitive > > > > > > to > > > > > > > > > > write something like: > > > > > > > > > > > > > > > > > > > > ```java > > > > > > > > > > class MessageIdPosition implements > > > > Comparable<MessageIdPosition> { > > > > > > > > > > private final MessageId messageId; > > > > > > > > > > // TODO: compare only the triple (ledger, entry, > batch) > > > > > > > > > > ``` > > > > > > > > > > > > > > > > > > > > Therefore, I'm going to write a proposal to redesign the > > > > MessageId > > > > > > > > > > interface only by adding some getters. Regarding the 5 > existing > > > > > > > > > > implementations, I think we can drop them because they > are a > > > > part > > > > > > > > > > of `pulsar-client`, not `pulsar-client-api`. > > > > > > > > > > > > > > > > > > > > Please feel free to share your points. > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Yunze > > > > > > > > > > >