Re: Omitting TermVector info and index size

Grant Ingersoll Wed, 14 Feb 2007 06:36:55 -0800

As Erik stated, you don't need term vectors to do spans, but Ithought I would add a bit on the difference between positions andoffsets.

Positions are what is stored in Lucene internally (seeToken.getPositionIncrement() and TermPositions) and are usually justconsecutive integers (although they can be manipulated, as can theoffsets), whereas offsets are the character offsets from the indexedtext (see Token.startOffset() and Token.endOffset()).

I haven't used the highlighter, but I think it does have options forworking with term vectors so that you don't have to re-analyzeeverything, so there may be some performance benefit to storing them,at the cost of disk space, like you said.


On Feb 14, 2007, at 9:03 AM, Erick Erickson wrote:

I'm indexing books, with a significant amount of overhead in eachdocumentand a LOT of OCR data. I'm indexing over 20,000 books and the indexsize is8G. So I decided to play around with not storing some of thetermvectorinformation and I'm shocked at how much smaller the index is. Bystoring allmy fields with Field.TermVector.WITH_POSITIONS, my index is reducedby OVER
75%. It went from 485M to 100M for my sample of 1,000 documents. Which
implies my full index will be somewhere around 2G (I'll build thefull index
tonight and see).
My reasoning was that I do need position information since I needto do Spanqueries, but character information (WITH_OFFSETS) isn't necessaryhere/now.So I thought I'd make a small test to see if this was worthpursuing. Ifomitting offsets had only saved me 10%, for instance, I wouldn'tpursue it
very much. But 75+% is a savings well worth pursuing.
All of my unit tests run, some of which include spans andhighlighting.Whether they're sophisticated enough to catch some subtle issue Iwon't
guarantee.

I do NOT need to reconstruct the text, nor do I need to highlight with
what's in the index, I handle highlighting by putting my displaydata in a
MemoryIndex and running a query against that. I play some fun games to
correlate my display and MemoryIndex info, but that's anotherstory. Many
thanks for the MemoryIndex contribution!!!

With that as a background, I have two questions....
1> Am I going off a cliff here? I suppose this is really answered by
2> what is the difference between WITH_POSITIONS and WITH_OFFSETSand YESand NO? I assume that WITH_POSITIONS is necessary for Span queries,forinstance, which is all I really care about. While this has beendiscussed, Isearched and didn't find a satisfactory answer (or at least ananswer that I
understood<G>).
I looked at Grants PowerPoint presentation and I guess I'm reallylookingfor confirmation of my interpretation that WITH_POSITIONS lets medo spanqueries and WITH_OFFSETS is irrelevant in my situation, one where Idon't
highlight and don't need to reconstruct the document......

Many thanks
Erick


--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/LuceneFAQ




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Omitting TermVector info and index size

Reply via email to