Hi Penghui, Thanks for your feedback!
I'm not sure about this either, since publishTimestamp is a Messaging layer concept, and ML as a Persistence layer should not be aware about this. But in ML, I'd noticed some methods searching message by PublishTimestamp(say, ManagedLedgerImpl#getEarliestMessagePublishTimeInBacklog), so that's why I want to add publishTimestamp to ML. Introduce secondary index to ML is a good idea, since RocketMQ has a `Hash index`, and Kakfa has a `Sparse index`. For finding message by timestamp, we can introduce `sparse index` to Pulsar, after add entries complete, add a index to `ManagedLedgerIndex` and store the index to ML. What do you think? Thanks, Tao Jiuming PengHui Li <peng...@apache.org> 于2024年3月25日周一 15:17写道: > Hi, Jiuming > > I'm sorry for not getting back to you sooner. > > First, I support the motivation to optimize this case because it could be a > significant > blocker for users who want infinite data retention, which is a BIG > differentiator > with Apache Kafka. And, I really saw the cases with high publish > throughput, and one > ledger could even hold 1M entries, 100M new entries published to a topic. > > Then, I try to check the details of the existing implementation. I think > the tricky part is > the publish time is not the concept of the ManageLedger. I saw the changes > that you > proposed will add publish time to the ManageLedger module, which doesn't > look good > me. Because it will couple the Pulsar concept with the ManageLedger > concept. > > Essentially, the publish time could be a secondary index of the > ManageLedger. > My opinion is to have a general ManagedLedgerIndex abstract, and the Pulsar > broker > can create any index it wants. Since the broker creates the index, the > broker can control the > index's behavior. Then, the ManageLedger can provide an API to search the > entry > with a ManagedLedgerIndex. With this option, we don't need to add the > publish > time concept to ManagedLedger directly. > > In this case, if the broker tries to search the entry with a predicate and > index. The managed > ledger will search from the index first. Of course, if the relevant entry > cannot be found in the index, > just fall back to the "optimized full scan". > > Regards, > Penghui > > > On Mon, Mar 25, 2024 at 11:51 AM 太上玄元道君 <dao...@apache.org> wrote: > > > bump > > > > 太上玄元道君 <dao...@apache.org>于2024年3月20日 周三16:23写道: > > > > > bump > > > > > > 太上玄元道君 <dao...@apache.org>于2024年3月19日 周二19:35写道: > > > > > >> Hi Pulsar community, > > >> > > >> This thread is to start a vote for PIP-345: Optimize finding message > by > > >> timestamp > > >> > > >> PIP: https://github.com/apache/pulsar/pull/22234 > > >> Discuss thread: > > >> https://lists.apache.org/thread/5owc9os6wmy52zxbv07qo2jrfjm17hd2 > > >> > > >> Thanks, > > >> Tao Jiuming > > >> > > > > > >