Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

Tom Burton-West Tue, 30 Jul 2013 08:10:07 -0700

Thanks Mike,

Billion not Trillion Doh!


Wasn't thinking it through when I titled the e-mail.... The total number of
tokens shouldn't be unusual compared to our other indexes since whether we
index pages or whole docs, the number of tokens shouldn't change
significantly.    The main difference between this and our other indexes is
the number of documents.   Our regular indexes have maybe 800,000 docs
wheras these have about 82 million.

I'm not sure what is going on but I'm guessing that the Checkindex program
has been caught in some GC loop for the last few days.  I didn't start it
up with any GC logging or hooks to attach jconsole.  I'm going to kill it
and maybe try again and give it more memory and maybe turn on GC logging.

Tom


On Tue, Jul 30, 2013 at 8:41 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> I think that's ~ 110 billion, not trillion, tokens :)
>
> Are you certain you don't have any term vectors?
>
> Even if your index has no term vectors, CheckIndex goes through all
> docIDs trying to load them, but that ought to be very fast, and then
> you should see "test: doc values..." after that.
>
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Jul 29, 2013 at 4:30 PM, Tom Burton-West <tburt...@umich.edu>
> wrote:
> > We have very large indexes, almost a terabyte for a single index, and
> > normally it takes overnight to run a checkindex.   I started a CheckIndex
> > on Friday and today (Monday) it seems to be stuck testing vectors
> although
> > we haven't got vectors turned on. (See below)
> > The output file was last written Jul 27 02:28,
> > Note that in this 750 GB segment we have about  83 million docs with
> about
> > 2.4 billion unique terms and about 110 trillion tokens.
> >
> > Have we hit a new CheckIndex limit?
> >
> >
> > Tom
> >
> > -----------------------
> >
> >
> > Opening index @ /htsolr/lss-dev/solrs/4.2/3/core/data/index
> >
> > Segments file=segments_e numSegments=2 version=4.2.1 format=
> > userData={commitTimeMSec=1374712392103}
> >   1 of 2: name=_bch docCount=82946896
> >     codec=Lucene42
> >     compound=false
> >     numFiles=12
> >     size (MB)=752,005.689
> >     diagnostics = {timestamp=1374657630506, os=Linux,
> > os.version=2.6.18-348.12.1.el5, mergeFactor=16, source=merge,
> > lucene.version=4.2.1 1461071 - mark - 2013-03-26 08:23:34, os.arch=amd64,
> > mergeMaxNumSegments=2, java.version=1.6.0_16, java.vendor=Sun
> Microsystems
> > Inc.}
> >     no deletions
> >     test: open reader.........OK
> >     test: fields..............OK [12 fields]
> >     test: field norms.........OK [3 fields]
> >     test: terms, freq, prox...OK [2442919802 terms; 73922320413
> terms/docs
> > pairs; 109976572432 tokens]
> >     test: stored fields.......OK [960417844 total field count; avg 11.579
> > fields per doc]
> >     test: term vectors........
> > ~
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

Reply via email to