Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

Robert Muir Fri, 02 Aug 2013 12:04:35 -0700

Thanks, this is what I expected. I opened an issue to remove seek by Ord
from this vectors format.
On Aug 2, 2013 2:13 PM, "Tom Burton-West" <tburt...@umich.edu> wrote:


> Thanks Robert,
>
> Looks like it switches between seekCeil and seekExact:
>
> "main" prio=10 tid=0x000000000e79a000 nid=0x5fe5 runnable
> [0x00002b32de0cc000]
> jstack.out3-   java.lang.Thread.State: RUNNABLE
> jstack.out3-    at
>
> org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTermsEnum.seekCeil(CompressingTermVectorsReader.java:846)
> jstack.out3-    at
> org.apache.lucene.index.TermsEnum.seekCeil(TermsEnum.java:89)
> jstack.out3-    at
> org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:1110)
> jstack.out3-    at
> org.apache.lucene.index.CheckIndex.testTermVectors(CheckIndex.java:1503)
> jstack.out3-    at
> org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:613)
> jstack.out3:    at
> org.apache.lucene.index.CheckIndex.main(CheckIndex.java:1854)
> jstack.out3-
>
>
>
> "main" prio=10 tid=0x000000000e79a000 nid=0x5fe5 runnable
> [0x00002b32de0cc000]
>    java.lang.Thread.State: RUNNABLE
>         at
>
> org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTermsEnum.seekExact(CompressingTermVectorsReader.java:857)
>         at
> org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:1103)
>         at
> org.apache.lucene.index.CheckIndex.testTermVectors(CheckIndex.java:1503)
>         at
> org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:613)
>         at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:1854)
>
> I don't think highlighting is too slow (at least for our small indexes),
> but will take a look at the postingshighligher
>
>
> Tom
>
> >
> >
> > Hi Tom: with this large term vector file its not really 343GB but, as far
> > as checkindex is concerned, its treated as 1000 343MB indexes (maybe
> more,
> > they are compressed also): because each document's term vector is like a
> > little inverted index for the document. Each one is on your large
> full-text
> > field so it has its own term dictionary and "postings" (all those
> > positions/offsets from your doc) to verify. Its probably the case that
> term
> > vectors with huge numbers of unique terms aren't particularly optimized
> for
> > your use-case either: for example seekCeil() operation looks like a
> linear
> > scan to me: and checkindex tests term seeking if the termsenum supports
> ord
> > (which it does). You could probably use jstack to confirm some of this.
> Was
> > highlighting with vectors horribly slow? :)
> >
> > Its off-topic but maybe something like postingshighlighter would be a
> > better fit for you, as it wouldnt duplicate the terms or positions, just
> > encode some offsets into the .pay file.
> >
> > Anyway, In my opinion, we should think about a JIRA issue such that if
> you
> > pass the -verbose flag to checkindex it prints some status information
> > about its progress. We could also think about trying to improve seekCeil
> > for term vector term dictionaries...
> >
>

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

Reply via email to