I belive the problem is that the text value is not the only data associated with a token, there is for instance the position offset. Depending on your JVM, each instance reference consume 64 bits or so, so even if the text value is flyweighted by String.intern() there is a cost. I doubt that a document is flushed to the segment prior to a fields token stream has been exhaused.

--
karl


1 sep 2007 kl. 21.50 skrev Askar Zaidi:

I have indexed around 100 M of data with 512M to the JVM heap. So that gives you an idea. If every token is the same word in one file, shouldn't the
tokenizer recognize that ?

Try using Luke. That helps solving lots of issues.

-
AZ

On 9/1/07, Erick Erickson <[EMAIL PROTECTED]> wrote:

I can't answer the question of why the same token
takes up memory, but I've indexed far more than
20M of data in a single document field. As in on the
order of 150M. Of course I allocated 1G or so to the
JVM, so you might try that....

Best
Erick

On 8/31/07, Per Lindberg <[EMAIL PROTECTED]> wrote:

I'm creating a tokenized "content" Field from a plain text file
using an InputStreamReader and new Field("content", in);

The text file is large, 20 MB, and contains zillions lines,
each with the the same 100-character token.

That causes an OutOfMemoryError.

Given that all tokens are the *same*,
why should this cause an OutOfMemoryError?
Shouldn't StandardAnalyzer just chug along
and just note "ho hum, this token is the same"?
That shouldn't take too much memory.

Or have I missed something?




-------------------------------------------------------------------- -
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to