RE: Avoid memory issues when indexing terms with multiplicity

Dávid Nemeskey Mon, 07 Apr 2014 03:39:07 -0700

Hi Uwe,

thanks for your reply, too. :)


I must admit that I was ahead of myself in the mail a bit, because I am not
using a TokenFilter yet, but expand the tokens manually before sending them to
Lucene. It is good to know that it makes a difference. I will definitely try the
TokenStream-based solution to see if that solves the problem.

Thanks,
David

> On April 4, 2014 at 11:59 PM Uwe Schindler <u...@thetaphi.de> wrote:
>
>
> Hi,
>
> > The use-case is that some of the fields in the document are made up of
> > term:frequency pairs. What I am doing right now is to expand these with a
> > TokenFilter, so that for e.g. "dog:3 cat:2", I return "dog dog dog cat cat",
> > and
> > index that. However, the problem is that when these fields contain real data
> > (anchor text, references, etc.), the resulting field texts for some
> > documents
> > can be really huge; so much in fact, that I get OutOfMemory exceptions.
>
> If the TokenStream just repeats the same token without cloning the bytes over
> and over, this should not be an issue. The TokenFilter should use
> captureState() and redo the same token multiple times. This should have no
> effect on memory usage. How does your TokenFilter looks like? I can check it,
> maybe there is a problem with it.
>
> Uwe
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

RE: Avoid memory issues when indexing terms with multiplicity

Reply via email to