Re: Avoid memory issues when indexing terms with multiplicity

Dávid Nemeskey Mon, 07 Apr 2014 03:36:14 -0700

Hi Greg,

thanks for the reply. We used #1 before, but we want to get rid of positions in
our index, they had a very noticable effect on the performance.


As for #2: I was looking for something like this, thanks! Now the only question
is how do I do it. :) Can I specify what TermsConsumer to use, the same as with
e.g. the Similarities? Is it enough to modify only the TermsConsumer, or do I
have to look at other files, too?

Thanks,
David

> On April 4, 2014 at 11:09 PM Gregory Dearing <gregdear...@gmail.com> wrote:
>
> Hi David,
>
> I'm not an expert, but I've climbed through the consumers myself in the
> past. The big limit is that the full postings for a document or document
> block must fit into memory. There may be other hidden processing limits
> (ie. memory used per-field).
>
> I think it would be possible to create a custom consumer chain that avoids
> these limits problems, but it would be a lot of work.
>
> My suggestions would be...
>
> 1.) If you're able to index your documents when not expanding terms,
> consider whether expansion is really necessary.
>
> If you're expanding them for relavency purposes, then consider storing the
> frequency as a payload. You can use something like PayloadTermQuery and
> Similarity.scorePayload() to adjust scoring based on the value. I wouldn't
> expect this to noticably affect query times but, of course, it will depend
> on your use case.
>
> 2.) I think you could override your TermsConsumer's implementation of
> finishTerm() to rewrite "dog:3" as "dog" and multiply Term Frequency by 3,
> right before the term is written to the postings. This is not for the
> faint of heart, and I wouldn't recommend trying unless #1 doesn't meet your
> needs.
>
> -Greg
>
>
>
> On Fri, Apr 4, 2014 at 6:16 AM, Dávid Nemeskey <da...@cliqz.com> wrote:
>
> > Hi guys,
> >
> > I have just recently (re-)joined the list. I have an issue with indexing;
> > I hope
> > someone can help me with it.
> >
> > The use-case is that some of the fields in the document are made up of
> > term:frequency pairs. What I am doing right now is to expand these with a
> > TokenFilter, so that for e.g. "dog:3 cat:2", I return "dog dog dog cat
> > cat", and
> > index that. However, the problem is that when these fields contain real
> > data
> > (anchor text, references, etc.), the resulting field texts for some
> > documents
> > can be really huge; so much in fact, that I get OutOfMemory exceptions.
> >
> > I would be grateful if someone could tell me how this issue could be
> > solved. I
> > thought of circumventing the problem by maximizing the frequency I allow or
> > using the logarithm thereof, but it would be nice to know if there is a
> > proper
> > solution for the problem. I have had a look at the code, but got lost in
> > all the
> > different Consumers. Here are a few questions I have come up with, but the
> > real
> > solution might be something entirely different...
> >
> > 1. Is there information on how much using payloads (and hence positions)
> > slow
> > down querying?
> > 2. Provided that I do not want payloads, can I extend something (perhaps a
> > Consumer) to achieve what I want?
> > 3. Is there a documentation somewhere that describes how indexing works,
> > which
> > Consumer, Writer, etc. is invoked when?
> > 4. Am I better off by just post-processing indices, perhaps by writing the
> > frequency to a payload during indexing, and then run through the index,
> > remove
> > the payloads and positions and writing the posting lists myself?
> >
> > Thank you very much.
> >
> > Best,
> > Dávid Nemeskey

Re: Avoid memory issues when indexing terms with multiplicity

Reply via email to