On Thu, Aug 1, 2013 at 6:40 PM, Tom Burton-West <tburt...@umich.edu> wrote:
> Hi all, > > OK, I really should have titled the post, "CheckIndex limit with large tvd > files?" > > I started a new CheckIndex run about 1:00 pm on Tuesday and it seems to be > stuck again looking at termvectors. > I gave CheckIndex 32GB of memory, turned on GC logging, and echoed STDERR > and STDOUT to a file > > It's seems stuck while testing term vectors, but maybe it just takes > several days to test a term vector file that is 343GB. > Hi Tom: with this large term vector file its not really 343GB but, as far as checkindex is concerned, its treated as 1000 343MB indexes (maybe more, they are compressed also): because each document's term vector is like a little inverted index for the document. Each one is on your large full-text field so it has its own term dictionary and "postings" (all those positions/offsets from your doc) to verify. Its probably the case that term vectors with huge numbers of unique terms aren't particularly optimized for your use-case either: for example seekCeil() operation looks like a linear scan to me: and checkindex tests term seeking if the termsenum supports ord (which it does). You could probably use jstack to confirm some of this. Was highlighting with vectors horribly slow? :) Its off-topic but maybe something like postingshighlighter would be a better fit for you, as it wouldnt duplicate the terms or positions, just encode some offsets into the .pay file. Anyway, In my opinion, we should think about a JIRA issue such that if you pass the -verbose flag to checkindex it prints some status information about its progress. We could also think about trying to improve seekCeil for term vector term dictionaries...