Hi Greg, thanks for the reply. We used #1 before, but we want to get rid of positions in our index, they had a very noticable effect on the performance.
As for #2: I was looking for something like this, thanks! Now the only question is how do I do it. :) Can I specify what TermsConsumer to use, the same as with e.g. the Similarities? Is it enough to modify only the TermsConsumer, or do I have to look at other files, too? Thanks, David > On April 4, 2014 at 11:09 PM Gregory Dearing <gregdear...@gmail.com> wrote: > > Hi David, > > I'm not an expert, but I've climbed through the consumers myself in the > past. The big limit is that the full postings for a document or document > block must fit into memory. There may be other hidden processing limits > (ie. memory used per-field). > > I think it would be possible to create a custom consumer chain that avoids > these limits problems, but it would be a lot of work. > > My suggestions would be... > > 1.) If you're able to index your documents when not expanding terms, > consider whether expansion is really necessary. > > If you're expanding them for relavency purposes, then consider storing the > frequency as a payload. You can use something like PayloadTermQuery and > Similarity.scorePayload() to adjust scoring based on the value. I wouldn't > expect this to noticably affect query times but, of course, it will depend > on your use case. > > 2.) I think you could override your TermsConsumer's implementation of > finishTerm() to rewrite "dog:3" as "dog" and multiply Term Frequency by 3, > right before the term is written to the postings. This is not for the > faint of heart, and I wouldn't recommend trying unless #1 doesn't meet your > needs. > > -Greg > > > > On Fri, Apr 4, 2014 at 6:16 AM, Dávid Nemeskey <da...@cliqz.com> wrote: > > > Hi guys, > > > > I have just recently (re-)joined the list. I have an issue with indexing; > > I hope > > someone can help me with it. > > > > The use-case is that some of the fields in the document are made up of > > term:frequency pairs. What I am doing right now is to expand these with a > > TokenFilter, so that for e.g. "dog:3 cat:2", I return "dog dog dog cat > > cat", and > > index that. However, the problem is that when these fields contain real > > data > > (anchor text, references, etc.), the resulting field texts for some > > documents > > can be really huge; so much in fact, that I get OutOfMemory exceptions. > > > > I would be grateful if someone could tell me how this issue could be > > solved. I > > thought of circumventing the problem by maximizing the frequency I allow or > > using the logarithm thereof, but it would be nice to know if there is a > > proper > > solution for the problem. I have had a look at the code, but got lost in > > all the > > different Consumers. Here are a few questions I have come up with, but the > > real > > solution might be something entirely different... > > > > 1. Is there information on how much using payloads (and hence positions) > > slow > > down querying? > > 2. Provided that I do not want payloads, can I extend something (perhaps a > > Consumer) to achieve what I want? > > 3. Is there a documentation somewhere that describes how indexing works, > > which > > Consumer, Writer, etc. is invoked when? > > 4. Am I better off by just post-processing indices, perhaps by writing the > > frequency to a payload during indexing, and then run through the index, > > remove > > the payloads and positions and writing the posting lists myself? > > > > Thank you very much. > > > > Best, > > Dávid Nemeskey