Hi, Girish,

Thanks for your feedback!

In general, it's a very good suggestion, we can just use one single
`beginPublishTimestamp` to achieve our goal,
but the actual problem will be a bit more complex.

Actually, the naming of `beginPublishTimestamp` and `endPublishTimestamp`
has a little problem,
it should be `minPublishTimestamp` and `maxPublishTimestamp`.

In some cases, next ledger's `minPublishTimestamp` may less than it's
previous ledger's `maxPublishTimestamp`,
so we have to maintain both `minPublishTimestamp` and `maxPublishTimestamp`.

Say, there are 2 producers publishing to the topic, Producer1 send
*message1* to the topic, broker received
*message1* immediately and persist it to the ledger. Producer2 send
*message2* to the broker *before* *message1*,
but for some reason, broker received *message2* after a while.
At the same time, Ledger switching happens, the previous ledger's
`maxPublishTimestamp` is *message1*'s publishTimestamp,
and the current ledger's `minPublishTimestamp` is *message2*'s
publishTimestamp,
so the current ledger's `minPublishTimestamp` is less than the previous
ledger's `maxPublishTimestamp`, right?

If we just have a single field  `minPublishTimestamp`, it will have a
hidden meaning: the next ledger's `minPublishTimestamp`
is it's previous ledger's `maxPublishTimestamp`, it's incorrect.
So we want to introduce `minPublishTimestamp` and `maxPublishTimestamp` to
make it clear.

Thanks,
Tao Jiuming

Girish Sharma <scrapmachi...@gmail.com> 于2024年3月15日周五 12:14写道:

> One suggestion, I think you can make do with storing just begin timestamp.
> Any search utilising these values will work the same way with just one of
> those timestamps compared to both begin and end.
>
> Any particular reason you need both the timestamps?
>
> Regards
>
> On Fri, Mar 15, 2024, 9:39 AM 太上玄元道君 <dao...@apache.org> wrote:
>
> > bump
> >
> > 太上玄元道君 <dao...@apache.org>于2024年3月10日 周日06:41写道:
> >
> > > Hi Pulsar community,
> > >
> > > A new PIP is opened, this thread is to discuss PIP-345: Optimize
> finding
> > > message by timestamp.
> > >
> > > Motivation:
> > > Finding message by timestamp is widely used in Pulsar:
> > > * It is used by the `pulsar-admin` tool to get the message id by
> > > timestamp, expire messages by timestamp, and reset cursor.
> > > * It is used by the `pulsar-client` to reset the subscription to a
> > > specific timestamp.
> > > * And also used by the `expiry-monitor` to find the messages that are
> > > expired.
> > > Even though the current implementation is correct, and using binary
> > search
> > > to speed-up, but it's still not efficient *enough*.
> > > The current implementation is to scan all the ledgers to find the
> message
> > > by timestamp.
> > > This is a performance bottleneck, especially for large topics with many
> > > messages.
> > > Say, if there is a topic which has 1m entries, through the binary
> search,
> > > it will take 20 iterations to find the message.
> > > In some extreme cases, it may lead to a timeout, and the client will
> not
> > > be able to seeking by timestamp.
> > >
> > > PIP: https://github.com/apache/pulsar/pull/22234
> > >
> > > Your feedback is very important to us, please take a moment to review
> the
> > > proposal and provide your thoughts.
> > >
> > > Thanks,
> > > Tao Jiuming
> > >
> >
>

Reply via email to