Re: [DISCUSS] PIP-345: Optimize finding message by timestamp

太上玄元道君 Sat, 16 Mar 2024 23:08:17 -0700

Hi Yubiao,

Thanks for your review!


1. As I mentioned in the *Alternatives*  part of the PIP

`LedgerInfo#timestamp` is broker's timestamp, finding message by timestamp
is client's timestamp(publishTimestamp).
We have to consider about client's clock doesn't sync to broker's clock,
and we are finding a message by publishTimestamp, but pre-filter by
broker's timestamp, it doesn't make sense.

2. Why do we need both `minPublishTimestamp` and `maxPublishTimestamp`

As the email I replied to Girish, the next ledger's `minPublishTimestamp`
might be less than it's previous ledger's `maxPublishTimestamp`,
so we can't use next ledger's `minPublishTimestamp` as it's previous
ledger's `maxPublishTimestamp`, I want to make it clear.

Thanks,
Tao Jiuming

Yubiao Feng <yubiao.f...@streamnative.io.invalid> 于2024年3月16日周六 20:29写道：

> Hi Jiuming
>
> Firstly, I think the idea you provided is great.
>
> It seems the field `beginPublishTimestamp` is also not needed, and there is
> an existing field `ledgerInfo.timestamp` to use,
> - The current ledger's timestamp can be used as `beginPublishTimestamp`
> - The next ledger's timestamp can be used as `endPublishTimestamp`
>
> Thanks
> Yubiao Feng
>
>
> On Fri, Mar 15, 2024 at 1:57 PM 太上玄元道君 <dao...@apache.org> wrote:
>
> > Hi, Girish,
> >
> > Thanks for your feedback!
> >
> > In general, it's a very good suggestion, we can just use one single
> > `beginPublishTimestamp` to achieve our goal,
> > but the actual problem will be a bit more complex.
> >
> > Actually, the naming of `beginPublishTimestamp` and `endPublishTimestamp`
> > has a little problem,
> > it should be `minPublishTimestamp` and `maxPublishTimestamp`.
> >
> > In some cases, next ledger's `minPublishTimestamp` may less than it's
> > previous ledger's `maxPublishTimestamp`,
> > so we have to maintain both `minPublishTimestamp` and
> > `maxPublishTimestamp`.
> >
> > Say, there are 2 producers publishing to the topic, Producer1 send
> > *message1* to the topic, broker received
> > *message1* immediately and persist it to the ledger. Producer2 send
> > *message2* to the broker *before* *message1*,
> > but for some reason, broker received *message2* after a while.
> > At the same time, Ledger switching happens, the previous ledger's
> > `maxPublishTimestamp` is *message1*'s publishTimestamp,
> > and the current ledger's `minPublishTimestamp` is *message2*'s
> > publishTimestamp,
> > so the current ledger's `minPublishTimestamp` is less than the previous
> > ledger's `maxPublishTimestamp`, right?
> >
> > If we just have a single field  `minPublishTimestamp`, it will have a
> > hidden meaning: the next ledger's `minPublishTimestamp`
> > is it's previous ledger's `maxPublishTimestamp`, it's incorrect.
> > So we want to introduce `minPublishTimestamp` and `maxPublishTimestamp`
> to
> > make it clear.
> >
> > Thanks,
> > Tao Jiuming
> >
> > Girish Sharma <scrapmachi...@gmail.com> 于2024年3月15日周五 12:14写道：
> >
> > > One suggestion, I think you can make do with storing just begin
> > timestamp.
> > > Any search utilising these values will work the same way with just one
> of
> > > those timestamps compared to both begin and end.
> > >
> > > Any particular reason you need both the timestamps?
> > >
> > > Regards
> > >
> > > On Fri, Mar 15, 2024, 9:39 AM 太上玄元道君 <dao...@apache.org> wrote:
> > >
> > > > bump
> > > >
> > > > 太上玄元道君 <dao...@apache.org>于2024年3月10日 周日06:41写道：
> > > >
> > > > > Hi Pulsar community,
> > > > >
> > > > > A new PIP is opened, this thread is to discuss PIP-345: Optimize
> > > finding
> > > > > message by timestamp.
> > > > >
> > > > > Motivation:
> > > > > Finding message by timestamp is widely used in Pulsar:
> > > > > * It is used by the `pulsar-admin` tool to get the message id by
> > > > > timestamp, expire messages by timestamp, and reset cursor.
> > > > > * It is used by the `pulsar-client` to reset the subscription to a
> > > > > specific timestamp.
> > > > > * And also used by the `expiry-monitor` to find the messages that
> are
> > > > > expired.
> > > > > Even though the current implementation is correct, and using binary
> > > > search
> > > > > to speed-up, but it's still not efficient *enough*.
> > > > > The current implementation is to scan all the ledgers to find the
> > > message
> > > > > by timestamp.
> > > > > This is a performance bottleneck, especially for large topics with
> > many
> > > > > messages.
> > > > > Say, if there is a topic which has 1m entries, through the binary
> > > search,
> > > > > it will take 20 iterations to find the message.
> > > > > In some extreme cases, it may lead to a timeout, and the client
> will
> > > not
> > > > > be able to seeking by timestamp.
> > > > >
> > > > > PIP: https://github.com/apache/pulsar/pull/22234
> > > > >
> > > > > Your feedback is very important to us, please take a moment to
> review
> > > the
> > > > > proposal and provide your thoughts.
> > > > >
> > > > > Thanks,
> > > > > Tao Jiuming
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] PIP-345: Optimize finding message by timestamp

Reply via email to