Re: Omitting TermVector info and index size

Erick Erickson Wed, 14 Feb 2007 10:29:50 -0800

OK, final note. I wish I knew what kind of drugs I was on when I first
thought that the sizes were so much smaller. Because they weren't. I got to
thinking that "gee, it's kind of weird that if you don't specify anything
for TermVector when creating a field, you get all this advanced stuff. If it
was me, I'd default it to the simple case and make the user explicitly
invoke the more complex one".


So I re-ran my tests and Lo! the default when creating a field is
TermVector.NO! Which means that my original surprise is totally off-base,
and I WAS indexing with TermVector.NO since I didn't specify anything. So
why I thought the index was so much smaller, I have no clue. I'll have to
plead temporary insanity.

That said, I'm glad I have a better understanding of what this is all about
anyway. But I also have to apologize for wasting everyone's time.

[EMAIL PROTECTED]

On 2/14/07, Erick Erickson <[EMAIL PROTECTED]> wrote:


It's always embarrassing when the correct unit test takes, say, 3 minutes
to write and I've engaged in all this angst that I could have dispelled all
by myself (although it is nice to have confirmation from folks in the know).


The answer is that omitting term vectors has no influence on the behavior
of span queries with custom positionincrementgap (s). My example performs
exactly the same regardless.

Thanks again all.

Erick

On 2/14/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
>
> Thanks for that addition, it may well be important to me (as well as
> pointing up a weakness in my unit tests. Honest, I've been thinking about
> explicitly testing this. Really. I'll get around to it real soon now.
> Truly....). We store multiple entries in the same field, think of it as
> storing a list of content. For instance....
>
> this is subject entry one
> this is subject entry two
> this is subject entry three
>
> all stored by separate calls to doc.add("subject", .....);
>
> One consequence of this is that doing a span query, say "one this"~2
> should NOT hit. So I've been setting the positionincrementgap to 1,000
> between successive calls to doc.add.
>
> If 'm reading your e-mail correctly, this implies that I *do* need to
> use WITH_POSITIONS. Correct? I know, I know. Just write the cursed unit test
> and answer your own question alright already......
>
> And then Mark's e-mail leads me to think I *don't* need to use
> WITH_POSITIONS. So I *will* write the unit test today. Honest. Oh, and Mark,
> Erik said that, not Erick <G>....... He's undoubtedly much more handsome
> than me, so wouldn't appreciate being confused....
>
> Thanks again all
> Erick
>
> On 2/14/07, Grant Ingersoll < [EMAIL PROTECTED]> wrote:
> >
> > As Erik stated, you don't need term vectors to do spans, but I
> > thought I would add a bit on the difference between positions and
> > offsets.
> >
> > Positions are what is stored in Lucene internally (see
> > Token.getPositionIncrement () and TermPositions) and are usually just
> > consecutive integers (although they can be manipulated, as can the
> > offsets), whereas offsets are the character offsets from the indexed
> > text (see Token.startOffset() and Token.endOffset()).
> >
> > I haven't used the highlighter, but I think it does have options for
> > working with term vectors so that you don't have to re-analyze
> > everything, so there may be some performance benefit to storing them,
> > at the cost of disk space, like you said.
> >
> > On Feb 14, 2007, at 9:03 AM, Erick Erickson wrote:
> >
> > > I'm indexing books, with a significant amount of overhead in each
> > > document
> > > and a LOT of OCR data. I'm indexing over 20,000 books and the index
> > > size is
> > > 8G. So I decided to play around with not storing some of the
> > > termvector
> > > information and I'm shocked at how much smaller the index is. By
> > > storing all
> > > my fields with Field.TermVector.WITH_POSITIONS, my index is reduced
> > > by OVER
> > > 75%. It went from 485M to 100M for my sample of 1,000 documents.
> > Which
> > > implies my full index will be somewhere around 2G (I'll build the
> > > full index
> > > tonight and see).
> > >
> > > My reasoning was that I do need position information since I need
> > > to do Span
> > > queries,  but character information (WITH_OFFSETS) isn't necessary
> > > here/now.
> > > So I thought I'd make a small test to see if this was worth
> > > pursuing. If
> > > omitting offsets had only saved me 10%, for instance, I wouldn't
> > > pursue it
> > > very much. But 75+% is a savings well worth pursuing.
> > >
> > > All of my unit tests run, some of which include spans and
> > > highlighting.
> > > Whether they're sophisticated enough to catch some subtle issue I
> > > won't
> > > guarantee.
> > >
> > > I do NOT need to reconstruct the text, nor do I need to highlight
> > with
> > > what's in the index, I handle highlighting by putting my display
> > > data in a
> > > MemoryIndex and running a query against that. I play some fun games
> > to
> > > correlate my display and MemoryIndex info, but that's another
> > > story. Many
> > > thanks for the MemoryIndex contribution!!!
> > >
> > > With that as a background, I have two questions....
> > > 1> Am I going off a cliff here? I suppose this is really answered by
> >
> > > 2> what is the difference between WITH_POSITIONS and WITH_OFFSETS
> > > and YES
> > > and NO? I assume that WITH_POSITIONS is necessary for Span queries,
> > > for
> > > instance, which is all I really care about. While this has been
> > > discussed, I
> > > searched and didn't find a satisfactory answer (or at least an
> > > answer that I
> > > understood<G>).
> > >
> > > I looked at Grants PowerPoint presentation and I guess I'm really
> > > looking
> > > for confirmation of my interpretation that WITH_POSITIONS lets me
> > > do span
> > > queries and WITH_OFFSETS is irrelevant in my situation, one where I
> > > don't
> > > highlight and don't need to reconstruct the document......
> > >
> > > Many thanks
> > > Erick
> >
> > --------------------------
> > Grant Ingersoll
> > Center for Natural Language Processing
> > http://www.cnlp.org
> >
> > Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
> > LuceneFAQ
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>

Re: Omitting TermVector info and index size

Reply via email to