Re: OutOfMemoryError tokenizing a boring text file

Karl Wettin Sat, 01 Sep 2007 13:06:52 -0700

I belive the problem is that the text value is not the only dataassociated with a token, there is for instance the position offset.Depending on your JVM, each instance reference consume 64 bits or so,so even if the text value is flyweighted by String.intern() there isa cost. I doubt that a document is flushed to the segment prior to afields token stream has been exhaused.


--
karl



1 sep 2007 kl. 21.50 skrev Askar Zaidi:

I have indexed around 100 M of data with 512M to the JVM heap. Sothat givesyou an idea. If every token is the same word in one file, shouldn'tthe

tokenizer recognize that ?

Try using Luke. That helps solving lots of issues.

-
AZ

On 9/1/07, Erick Erickson <[EMAIL PROTECTED]> wrote:


I can't answer the question of why the same token
takes up memory, but I've indexed far more than
20M of data in a single document field. As in on the
order of 150M. Of course I allocated 1G or so to the
JVM, so you might try that....

Best
Erick

On 8/31/07, Per Lindberg <[EMAIL PROTECTED]> wrote:


I'm creating a tokenized "content" Field from a plain text file
using an InputStreamReader and new Field("content", in);

The text file is large, 20 MB, and contains zillions lines,
each with the the same 100-character token.

That causes an OutOfMemoryError.

Given that all tokens are the *same*,
why should this cause an OutOfMemoryError?
Shouldn't StandardAnalyzer just chug along
and just note "ho hum, this token is the same"?
That shouldn't take too much memory.

Or have I missed something?

---------------------------------------------------------------------

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: OutOfMemoryError tokenizing a boring text file

Reply via email to