Hi Uwe, thanks for your reply, too. :)
I must admit that I was ahead of myself in the mail a bit, because I am not using a TokenFilter yet, but expand the tokens manually before sending them to Lucene. It is good to know that it makes a difference. I will definitely try the TokenStream-based solution to see if that solves the problem. Thanks, David > On April 4, 2014 at 11:59 PM Uwe Schindler <u...@thetaphi.de> wrote: > > > Hi, > > > The use-case is that some of the fields in the document are made up of > > term:frequency pairs. What I am doing right now is to expand these with a > > TokenFilter, so that for e.g. "dog:3 cat:2", I return "dog dog dog cat cat", > > and > > index that. However, the problem is that when these fields contain real data > > (anchor text, references, etc.), the resulting field texts for some > > documents > > can be really huge; so much in fact, that I get OutOfMemory exceptions. > > If the TokenStream just repeats the same token without cloning the bytes over > and over, this should not be an issue. The TokenFilter should use > captureState() and redo the same token multiple times. This should have no > effect on memory usage. How does your TokenFilter looks like? I can check it, > maybe there is a problem with it. > > Uwe > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org >