Thanks, this is what I expected. I opened an issue to remove seek by Ord from this vectors format. On Aug 2, 2013 2:13 PM, "Tom Burton-West" <tburt...@umich.edu> wrote:
> Thanks Robert, > > Looks like it switches between seekCeil and seekExact: > > "main" prio=10 tid=0x000000000e79a000 nid=0x5fe5 runnable > [0x00002b32de0cc000] > jstack.out3- java.lang.Thread.State: RUNNABLE > jstack.out3- at > > org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTermsEnum.seekCeil(CompressingTermVectorsReader.java:846) > jstack.out3- at > org.apache.lucene.index.TermsEnum.seekCeil(TermsEnum.java:89) > jstack.out3- at > org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:1110) > jstack.out3- at > org.apache.lucene.index.CheckIndex.testTermVectors(CheckIndex.java:1503) > jstack.out3- at > org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:613) > jstack.out3: at > org.apache.lucene.index.CheckIndex.main(CheckIndex.java:1854) > jstack.out3- > > > > "main" prio=10 tid=0x000000000e79a000 nid=0x5fe5 runnable > [0x00002b32de0cc000] > java.lang.Thread.State: RUNNABLE > at > > org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTermsEnum.seekExact(CompressingTermVectorsReader.java:857) > at > org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:1103) > at > org.apache.lucene.index.CheckIndex.testTermVectors(CheckIndex.java:1503) > at > org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:613) > at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:1854) > > I don't think highlighting is too slow (at least for our small indexes), > but will take a look at the postingshighligher > > > Tom > > > > > > > Hi Tom: with this large term vector file its not really 343GB but, as far > > as checkindex is concerned, its treated as 1000 343MB indexes (maybe > more, > > they are compressed also): because each document's term vector is like a > > little inverted index for the document. Each one is on your large > full-text > > field so it has its own term dictionary and "postings" (all those > > positions/offsets from your doc) to verify. Its probably the case that > term > > vectors with huge numbers of unique terms aren't particularly optimized > for > > your use-case either: for example seekCeil() operation looks like a > linear > > scan to me: and checkindex tests term seeking if the termsenum supports > ord > > (which it does). You could probably use jstack to confirm some of this. > Was > > highlighting with vectors horribly slow? :) > > > > Its off-topic but maybe something like postingshighlighter would be a > > better fit for you, as it wouldnt duplicate the terms or positions, just > > encode some offsets into the .pay file. > > > > Anyway, In my opinion, we should think about a JIRA issue such that if > you > > pass the -verbose flag to checkindex it prints some status information > > about its progress. We could also think about trying to improve seekCeil > > for term vector term dictionaries... > > >