RE: Avoid memory issues when indexing terms with multiplicity

Uwe Schindler Fri, 04 Apr 2014 15:00:25 -0700

Hi,

> The use-case is that some of the fields in the document are made up of
> term:frequency pairs. What I am doing right now is to expand these with a
> TokenFilter, so that for e.g. "dog:3 cat:2", I return "dog dog dog cat cat", 
> and
> index that. However, the problem is that when these fields contain real data
> (anchor text, references, etc.), the resulting field texts for some documents
> can be really huge; so much in fact, that I get OutOfMemory exceptions.


If the TokenStream just repeats the same token without cloning the bytes over 
and over, this should not be an issue. The TokenFilter should use 
captureState() and redo the same token multiple times. This should have no 
effect on memory usage. How does your TokenFilter looks like? I can check it, 
maybe there is a problem with it.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Avoid memory issues when indexing terms with multiplicity

Reply via email to