Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

Robert Muir Thu, 01 Aug 2013 16:42:06 -0700

On Thu, Aug 1, 2013 at 6:40 PM, Tom Burton-West <tburt...@umich.edu> wrote:


> Hi all,
>
> OK, I really should have titled the post, "CheckIndex limit with large tvd
> files?"
>
> I started a new CheckIndex run about 1:00 pm on Tuesday and it seems to be
> stuck again looking at termvectors.
> I gave CheckIndex 32GB of memory, turned on GC logging, and echoed STDERR
> and STDOUT to a file
>
> It's seems stuck while testing term vectors, but maybe it just takes
> several days to test a term vector file that is 343GB.
>

Hi Tom: with this large term vector file its not really 343GB but, as far
as checkindex is concerned, its treated as 1000 343MB indexes (maybe more,
they are compressed also): because each document's term vector is like a
little inverted index for the document. Each one is on your large full-text
field so it has its own term dictionary and "postings" (all those
positions/offsets from your doc) to verify. Its probably the case that term
vectors with huge numbers of unique terms aren't particularly optimized for
your use-case either: for example seekCeil() operation looks like a linear
scan to me: and checkindex tests term seeking if the termsenum supports ord
(which it does). You could probably use jstack to confirm some of this. Was
highlighting with vectors horribly slow? :)

Its off-topic but maybe something like postingshighlighter would be a
better fit for you, as it wouldnt duplicate the terms or positions, just
encode some offsets into the .pay file.

Anyway, In my opinion, we should think about a JIRA issue such that if you
pass the -verbose flag to checkindex it prints some status information
about its progress. We could also think about trying to improve seekCeil
for term vector term dictionaries...

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

Reply via email to