Re: [VOTE] PIP-345: Optimize finding message by timestamp

太上玄元道君 Mon, 25 Mar 2024 02:47:16 -0700

Hi Penghui,

Thanks for your feedback!


I'm not sure about this either, since publishTimestamp is a Messaging layer
concept, and ML as a Persistence layer should not be aware about this.

But in ML, I'd noticed some methods searching message by
PublishTimestamp(say,
ManagedLedgerImpl#getEarliestMessagePublishTimeInBacklog),
 so that's why I want to add publishTimestamp to ML.

Introduce secondary index to ML is a good idea, since RocketMQ has a `Hash
index`, and Kakfa has a `Sparse index`.

For finding message by timestamp, we can introduce `sparse index` to
Pulsar, after add entries complete, add a index to `ManagedLedgerIndex` and
store the index to ML. What do you think?

Thanks,
Tao Jiuming



PengHui Li <[email protected]> 于2024年3月25日周一 15:17写道：

> Hi, Jiuming
>
> I'm sorry for not getting back to you sooner.
>
> First, I support the motivation to optimize this case because it could be a
> significant
> blocker for users who want infinite data retention, which is a BIG
> differentiator
> with Apache Kafka. And, I really saw the cases with high publish
> throughput, and one
> ledger could even hold 1M entries, 100M new entries published to a topic.
>
> Then, I try to check the details of the existing implementation. I think
> the tricky part is
> the publish time is not the concept of the ManageLedger. I saw the changes
> that you
> proposed will add publish time to the ManageLedger module, which doesn't
> look good
> me. Because it will couple the Pulsar concept with the ManageLedger
> concept.
>
> Essentially, the publish time could be a secondary index of the
> ManageLedger.
> My opinion is to have a general ManagedLedgerIndex abstract, and the Pulsar
> broker
> can create any index it wants. Since the broker creates the index, the
> broker can control the
> index's behavior. Then, the ManageLedger can provide an API to search the
> entry
> with a ManagedLedgerIndex. With this option, we don't need to add the
> publish
> time concept to ManagedLedger directly.
>
> In this case, if the broker tries to search the entry with a predicate and
> index. The managed
> ledger will search from the index first. Of course, if the relevant entry
> cannot be found in the index,
> just fall back to the "optimized full scan".
>
> Regards,
> Penghui
>
>
> On Mon, Mar 25, 2024 at 11:51 AM 太上玄元道君 <[email protected]> wrote:
>
> > bump
> >
> > 太上玄元道君 <[email protected]>于2024年3月20日 周三16:23写道：
> >
> > > bump
> > >
> > > 太上玄元道君 <[email protected]>于2024年3月19日 周二19:35写道：
> > >
> > >> Hi Pulsar community,
> > >>
> > >> This thread is to start a vote for PIP-345: Optimize finding message
> by
> > >> timestamp
> > >>
> > >> PIP: https://github.com/apache/pulsar/pull/22234
> > >> Discuss thread:
> > >> https://lists.apache.org/thread/5owc9os6wmy52zxbv07qo2jrfjm17hd2
> > >>
> > >> Thanks,
> > >> Tao Jiuming
> > >>
> > >
> >
>

Re: [VOTE] PIP-345: Optimize finding message by timestamp

Reply via email to