I belive the problem is that the text value is not the only data
associated with a token, there is for instance the position offset.
Depending on your JVM, each instance reference consume 64 bits or so,
so even if the text value is flyweighted by String.intern() there is
a cost. I doubt tha
I have indexed around 100 M of data with 512M to the JVM heap. So that gives
you an idea. If every token is the same word in one file, shouldn't the
tokenizer recognize that ?
Try using Luke. That helps solving lots of issues.
-
AZ
On 9/1/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
>
> I can't
I can't answer the question of why the same token
takes up memory, but I've indexed far more than
20M of data in a single document field. As in on the
order of 150M. Of course I allocated 1G or so to the
JVM, so you might try that
Best
Erick
On 8/31/07, Per Lindberg <[EMAIL PROTECTED]> wrote:
Tom Roberts is out of the office until 3rd September 2007 and will get back to
you on his return.
http://www.luxonline.org.uk
http://www.lux.org.uk
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
Tom Roberts is out of the office until 3rd September 2007 and will get back to
you on his return.
http://www.luxonline.org.uk
http://www.lux.org.uk
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
Hi,
You should extend the SegmentReader (or IndexReader) class to implement the
following method:
*public* *long* termID(Term t) *throws* IOException { *return*
tis.getPosition(t);
}
which will give you a mean to get the ID of a given term. This ID is simply
the position of that term within ".