Does FilterDirectoryReader do what you want? https://lucene.apache.org/core/4_7_1/core/org/apache/lucene/index/FilterDirectoryReader.html
Alan Woodward www.flax.co.uk On 7 Apr 2014, at 22:19, Benson Margulies wrote: > Typically, an app gets a directory reader, which is a composite > reader. To get a filter down there into the leaves of the composite > reader, does anyone have a suggestion about where to enter the > modularity? > > I sort of want to insert myself at > org.apache.lucene.index.StandardDirectoryReader#open(org.apache.lucene.store.Directory, > org.apache.lucene.index.IndexCommit) wrapping the segment readers, or > I could make a sort of filtering composite reader that wraps each of > the segment readers in a filter. > > > On Mon, Apr 7, 2014 at 1:02 PM, Shai Erera <[email protected]> wrote: >> Given that DPF delegates indexing to another PF anyway (currently Lucene41), >> I think this might be the case. We would need to test of course. The key >> point is that this FilterAtomicReader will be able to serve anything as >> direct, even DV, so it might eliminate DVF too. We need to experiment and >> benchmark! >> >> Shai >> >> On Apr 7, 2014 7:32 PM, "[email protected]" >> <[email protected]> wrote: >>> >>> Aaaah, nice idea to simply use FilterAtomicReader — of course! So this >>> would ultimately be a new IndexReaderFactory that creates >>> FilterAtomicReaders for a subset of the fields you want to do this on. >>> Cool! With that, I don’t think there would be a need for >>> DirectPostingsFormat as a postings format, would there be? >>> >>> ~ David >>> >>> >>> On Mon, Apr 7, 2014 at 10:58 AM, Shai Erera <[email protected]> wrote: >>>> >>>> The only problem is how the Codec makes a dynamic decision on whether to >>>> use the wrapped Codec for reading vs pre-load data into in-memory >>>> structures, because Codecs are loaded through reflection by the SPI loading >>>> mechanism. >>>> >>>> There is also a TODO in DirectPF to allow wrapping arbitrary PFs, just >>>> mentioning in case you want to tackle DPF. >>>> >>>> I think that if we allowed passing something like a CodecLookupService, >>>> with an SPILookupService default impl, you could easily pass that to >>>> DirectoryReader which will use your runtime logic to load the right PF >>>> (e.g. >>>> DPF) instead of the one the index was created with. >>>> >>>> But it sounds like the core problem is that when we load a Codec/PF/DVF >>>> for reading, we cannot pass it any arguments, and so we must make an >>>> index-time decision about how we're going to read the data later on. If we >>>> could somehow support that, I think that will help you to achieve what you >>>> want too. >>>> >>>> E.g. currently it's an all-or-nothing decision, but if we could pass a >>>> parameter like "50% available heap", the Codec/PF/DVF could cache the >>>> frequently accessed postings instead of loading all of them into memory. >>>> But, that can also be achieved at the IndexReader level, through a custom >>>> FilterAtomicReader. And if you could reuse DPF's structures (like >>>> DirectTermsEnum, DirectFields...), it should be easier to do this. So >>>> perhaps we can think about a DirectAtomicReader which does that? I believe >>>> it can share some code w/ DPF, as long as we don't make these APIs public, >>>> or make them @super.experimental and @super.expert. >>>> >>>> Just throwing some ideas... >>>> >>>> Shai >>>> >>>> >>>> On Mon, Apr 7, 2014 at 5:35 PM, [email protected] >>>> <[email protected]> wrote: >>>>> >>>>> Benson, I like your idea. >>>>> >>>>> I think your idea can be achieved as a codec, one that wraps another >>>>> codec that establishes the on-disk format. By default the wrapped codec >>>>> can >>>>> be Lucene’s default codec. I think, if implemented, this would be a >>>>> change >>>>> to DPF instead of an additional DPF-variant codec. >>>>> >>>>> ~ David >>>>> >>>>> >>>>> On Mon, Apr 7, 2014 at 9:22 AM, Benson Margulies <[email protected]> >>>>> wrote: >>>>>> >>>>>> On Mon, Apr 7, 2014 at 9:14 AM, Robert Muir <[email protected]> wrote: >>>>>>> On Thu, Apr 3, 2014 at 12:27 PM, Benson Margulies >>>>>>> <[email protected]> wrote: >>>>>>> >>>>>>>> >>>>>>>> My takeaway from the prior conversation was that various people >>>>>>>> didn't >>>>>>>> entirely believe that I'd seen a dramatic improvement in query perfo >>>>>>>> using D-P-F, and so would not smile upon a patch intended to >>>>>>>> liberate >>>>>>>> D-P-F from codecs. It could be that the effect I saw has to do with >>>>>>>> the fact that our system depends on hitting and scoring 50% of the >>>>>>>> documents in an index with a lot of documents. >>>>>>>> >>>>>>> >>>>>>> I dont understand the word "liberate" here. why is it such a problem >>>>>>> that this is a codec? >>>>>> >>>>>> I don't want to have to declare my intentions at the time I create >>>>>> the index. I don't want to have to use D-P-F for all readers all the >>>>>> time. Because I want to be able to decide to open up an index with an >>>>>> arbitrary on-disk format and get the in-memory cache behavior of >>>>>> D-P-F. Thus 'liberate' -- split the question of 'keep a copy in >>>>>> memory' from the choice of the on-disk format. >>>>>> >>>>>> >>>>>>> >>>>>>> i do not think we should give it any more status than that, it wastes >>>>>>> too much ram. >>>>>> >>>>>> It didn't seem like 'waste' when it solved a big practical for us. We >>>>>> had an application that was too slow, and had plenty of RAM available, >>>>>> and we were able to trade space for time by applying D-P-F. >>>>>> >>>>>> Maybe I'm going about this backwards; if I can come up with a small, >>>>>> inconspicuous proposed change that does what I want, there won't be >>>>>> any disagreement. >>>>>> >>>>>> >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>> For additional commands, e-mail: [email protected] >>>>>>> >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: [email protected] >>>>>> For additional commands, e-mail: [email protected] >>>>>> >>>>> >>>> >>> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] >
