I can't answer the question of why the same token takes up memory, but I've indexed far more than 20M of data in a single document field. As in on the order of 150M. Of course I allocated 1G or so to the JVM, so you might try that....
Best Erick On 8/31/07, Per Lindberg <[EMAIL PROTECTED]> wrote: > > I'm creating a tokenized "content" Field from a plain text file > using an InputStreamReader and new Field("content", in); > > The text file is large, 20 MB, and contains zillions lines, > each with the the same 100-character token. > > That causes an OutOfMemoryError. > > Given that all tokens are the *same*, > why should this cause an OutOfMemoryError? > Shouldn't StandardAnalyzer just chug along > and just note "ho hum, this token is the same"? > That shouldn't take too much memory. > > Or have I missed something? > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >