SV: SV: OutOfMemoryError tokenizing a boring text file

2007-09-11 Thread Per Lindberg
> Från: Chris Hostetter [mailto:[EMAIL PROTECTED] > : Setting writer.setMaxFieldLength(5000) (default is 1) > : seems to eliminate the risk for an OutOfMemoryError, > > that's because it now gives up after parsing 5000 tokens. > > : To me, it appears that simply calling > :new Field("c

Re: SV: OutOfMemoryError tokenizing a boring text file

2007-09-03 Thread Chris Hostetter
: Setting writer.setMaxFieldLength(5000) (default is 1) : seems to eliminate the risk for an OutOfMemoryError, that's because it now gives up after parsing 5000 tokens. : To me, it appears that simply calling :new Field("content", new InputStreamReader(in, "ISO-8859-1")) : on a plain te

SV: OutOfMemoryError tokenizing a boring text file

2007-09-03 Thread Per Lindberg
ll: java-user@lucene.apache.org > Ämne: Re: OutOfMemoryError tokenizing a boring text file > > I belive the problem is that the text value is not the only data > associated with a token, there is for instance the position offset. > Depending on your JVM, each instance reference c

Re: OutOfMemoryError tokenizing a boring text file

2007-09-01 Thread Karl Wettin
I belive the problem is that the text value is not the only data associated with a token, there is for instance the position offset. Depending on your JVM, each instance reference consume 64 bits or so, so even if the text value is flyweighted by String.intern() there is a cost. I doubt tha

Re: OutOfMemoryError tokenizing a boring text file

2007-09-01 Thread Askar Zaidi
I have indexed around 100 M of data with 512M to the JVM heap. So that gives you an idea. If every token is the same word in one file, shouldn't the tokenizer recognize that ? Try using Luke. That helps solving lots of issues. - AZ On 9/1/07, Erick Erickson <[EMAIL PROTECTED]> wrote: > > I can't

Re: OutOfMemoryError tokenizing a boring text file

2007-09-01 Thread Erick Erickson
I can't answer the question of why the same token takes up memory, but I've indexed far more than 20M of data in a single document field. As in on the order of 150M. Of course I allocated 1G or so to the JVM, so you might try that Best Erick On 8/31/07, Per Lindberg <[EMAIL PROTECTED]> wrote:

OutOfMemoryError tokenizing a boring text file

2007-08-31 Thread Per Lindberg
I'm creating a tokenized "content" Field from a plain text file using an InputStreamReader and new Field("content", in); The text file is large, 20 MB, and contains zillions lines, each with the the same 100-character token. That causes an OutOfMemoryError. Given that all tokens are the *same*,