Matteo,
Improving Protobuf serialization and deserialization can be a separate PIP
because it doesn't solve the problems that this PIP tries to solve.
This PIP probably doesn't state the problem clearly. "Raw Metadata" is
probably an inaccurate name. It should probably be called "Broker Metadata"
which can be used for distinguishing the existing metadata that is
generated at the client-side.
The "Broker Metadata" is used for storing all the information generated by
brokers, such as broker-side publish (append) timestamp, monotonically
increasing log index, and etc.
The main reason we don't re-use existing "metadata" for this purpose is not
serialization & deserialization concerns. It is more about checksum
concern. Because the current metadata section is generated and the
corresponding checksum is set by the clients. If we want to mutate the
metadata, we have to re-generate the checksum. This would hugely impact
performance.
Introducing a "Broker Metadata" section can avoid re-generating checksum
for a message batch. Also, it separates the client-generated metadata from
broker-generated metadata, avoiding any mistakes in touching
client-generated metadata.
Hope this clarifies the problems that this PIP tries to solve.
Aloys,
If you agree with my comments above, I think we should rename "Raw
Metadata" to "Broker Metadata" to make it clear to avoid any confusion.
Thanks,
Sijie
On Wed, Nov 18, 2020 at 10:47 AM Matteo Merli
wrote:
> The goal of this proposal can be achieved automatically by using a
> better ser/de generator that doesn't have overhead for ignored fields.
>
> I'm preparing a revamping of the current Protobuf serialization and
> I'll send a proposal soon.
>
> Matteo
>
>
> --
> Matteo Merli
>
>
> On Sun, Nov 8, 2020 at 10:25 PM Aloys Zhang wrote:
> >
> > Hi all,
> >
> > We have drafted a proposal for supporting lightweight raw Message
> metadata
> > which can be found at
> >
> https://github.com/apache/pulsar/wiki/PIP-70%3A-Introduce-lightweight-raw-Message-metadata
> > and
> >
> https://docs.google.com/document/d/1IgnF9AJzL6JG6G4EL_xcoQxvOpd7bUXcgxFApBiPOFY
> >
> > Also, I copy it to the email thread for easier viewing.
> >
> > Any suggestions or ideas are welcomed to join the discussion.
> >
> >
> >
> > ## PIP-70: Introduce lightweight raw Message metadata
> >
> > ### 1. Motivation
> >
> > For messages in Pulsar, If we want to add new property, we always change
> > the `MessageMetadata` in protocol(PulsarApi.proto), this kind of property
> > could be understood by both the broker side and client side by
> > deserializing the `MessageMetadata` . But in some different cases,, the
> > property needs to be added from the broker side, Or need to be understood
> > by the broker side in a low cost way. When the broker side gets the
> message
> > produced from the client, we could add the property at a new area, which
> > does not combine with `MessageMetadata`, and no need deserializing
> original
> > `MessageMetadata` when gets it out ; and when the broker sends the
> message
> > to client, we could choose to filter out this part of property(or not as
> > the client needs). We call this kind of property “raw Message metadata”.
> By
> > this way, the “raw Message metadata” consumption is independent, and not
> > related with the original `MessageMetadata`.
> >
> > The benefit for this kind of “raw Message metadata” is that the broker
> does
> > not need to serialize/deserialize for the protobuf-ed `MessageMetadata`,
> > this will provide a better performance. And also could provide a lot of
> > features that are not supported yet.
> >
> > Here are some of the use cases for raw Message metadata:
> > 1) Provide ordered messages by time(broker side) sequence to make message
> > seek by time more accurate.
> > Currently, each message has a `publish_time`, it uses client side time,
> but
> > for different producers in different clients, the time may not align
> > between clients, and cause the message order and the message time
> > (`publish_time`) order may be different. But each topic-partition only
> has
> > one owner broker, if we append broker side time in the “raw Message
> > metadata”, we could make sure the message order is aligned with broker
> side
> > time. With this feature, we could handle the message seek by time more
> > accurately.
> >
> > 2) Provide continuous message sequence-Id for messages in one
> > topic-partition.
> > MessageId is a combination of ledgerId+entryId+batchIndex; for a
> partition
> > that contains more than one ledger, the Ids inside is not continuous. By
> > this solution, we could append a sequence-Id at the end of each Message.
> > This will make the message sequence management earlier.
> >
> > In this proposal, we will take count in the first feature “provide
> ordered
> > message by time(broker side) sequence” mentioned above, this will be
> easier
> > to go through the proposal.
> >
> > ### 2. Message and “raw Message metadata” structure changes