Hi, Jiuming

I'm sorry for not getting back to you sooner.

First, I support the motivation to optimize this case because it could be a
significant
blocker for users who want infinite data retention, which is a BIG
differentiator
with Apache Kafka. And, I really saw the cases with high publish
throughput, and one
ledger could even hold 1M entries, 100M new entries published to a topic.

Then, I try to check the details of the existing implementation. I think
the tricky part is
the publish time is not the concept of the ManageLedger. I saw the changes
that you
proposed will add publish time to the ManageLedger module, which doesn't
look good
me. Because it will couple the Pulsar concept with the ManageLedger concept.

Essentially, the publish time could be a secondary index of the
ManageLedger.
My opinion is to have a general ManagedLedgerIndex abstract, and the Pulsar
broker
can create any index it wants. Since the broker creates the index, the
broker can control the
index's behavior. Then, the ManageLedger can provide an API to search the
entry
with a ManagedLedgerIndex. With this option, we don't need to add the
publish
time concept to ManagedLedger directly.

In this case, if the broker tries to search the entry with a predicate and
index. The managed
ledger will search from the index first. Of course, if the relevant entry
cannot be found in the index,
just fall back to the "optimized full scan".

Regards,
Penghui


On Mon, Mar 25, 2024 at 11:51 AM 太上玄元道君 <dao...@apache.org> wrote:

> bump
>
> 太上玄元道君 <dao...@apache.org>于2024年3月20日 周三16:23写道:
>
> > bump
> >
> > 太上玄元道君 <dao...@apache.org>于2024年3月19日 周二19:35写道:
> >
> >> Hi Pulsar community,
> >>
> >> This thread is to start a vote for PIP-345: Optimize finding message by
> >> timestamp
> >>
> >> PIP: https://github.com/apache/pulsar/pull/22234
> >> Discuss thread:
> >> https://lists.apache.org/thread/5owc9os6wmy52zxbv07qo2jrfjm17hd2
> >>
> >> Thanks,
> >> Tao Jiuming
> >>
> >
>

Reply via email to