On Tue, Oct 14, 2014 at 1:29 AM, Trejkaz <trej...@trypticon.org> wrote:

> Bit of thread necromancy here, but I figured it was relevant because
> we get exactly the same error.

Wow, blast from the past ...

>> Is it possible you are indexing an absurdly enormous document...?
>
> We're seeing a case here where the document certainly could qualify as
> "absurdly enormous". The doc itself is 2GB in size and the
> tokenisation is per-character, not per-word, so the number of
> generated terms must be enormous. Probably enough to fill 2GB...
>
> So I'm wondering if there is more info somewhere on why this is (or
> was? We're still using 3.6.x) a limit and whether it can be detected
> up-front. Some large amount of indexing time (~30 minutes) could be
> avoided if we can detect that it would have failed ahead of time.

The limit is still there; it's because Lucene uses an int internally
to address its memory buffer.

It's probably easiest to set a limit on the max sized doc you will
index?  Or, use LimitTokenCountFilter (available in newer releases) to
only index the first N tokens...

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to