Thanks Robert, Looks like it switches between seekCeil and seekExact:
"main" prio=10 tid=0x000000000e79a000 nid=0x5fe5 runnable [0x00002b32de0cc000] jstack.out3- java.lang.Thread.State: RUNNABLE jstack.out3- at org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTermsEnum.seekCeil(CompressingTermVectorsReader.java:846) jstack.out3- at org.apache.lucene.index.TermsEnum.seekCeil(TermsEnum.java:89) jstack.out3- at org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:1110) jstack.out3- at org.apache.lucene.index.CheckIndex.testTermVectors(CheckIndex.java:1503) jstack.out3- at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:613) jstack.out3: at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:1854) jstack.out3- "main" prio=10 tid=0x000000000e79a000 nid=0x5fe5 runnable [0x00002b32de0cc000] java.lang.Thread.State: RUNNABLE at org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTermsEnum.seekExact(CompressingTermVectorsReader.java:857) at org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:1103) at org.apache.lucene.index.CheckIndex.testTermVectors(CheckIndex.java:1503) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:613) at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:1854) I don't think highlighting is too slow (at least for our small indexes), but will take a look at the postingshighligher Tom > > > Hi Tom: with this large term vector file its not really 343GB but, as far > as checkindex is concerned, its treated as 1000 343MB indexes (maybe more, > they are compressed also): because each document's term vector is like a > little inverted index for the document. Each one is on your large full-text > field so it has its own term dictionary and "postings" (all those > positions/offsets from your doc) to verify. Its probably the case that term > vectors with huge numbers of unique terms aren't particularly optimized for > your use-case either: for example seekCeil() operation looks like a linear > scan to me: and checkindex tests term seeking if the termsenum supports ord > (which it does). You could probably use jstack to confirm some of this. Was > highlighting with vectors horribly slow? :) > > Its off-topic but maybe something like postingshighlighter would be a > better fit for you, as it wouldnt duplicate the terms or positions, just > encode some offsets into the .pay file. > > Anyway, In my opinion, we should think about a JIRA issue such that if you > pass the -verbose flag to checkindex it prints some status information > about its progress. We could also think about trying to improve seekCeil > for term vector term dictionaries... >