Re: Omitting TermVector info and index size

2007-02-14 Thread Erick Erickson
OK, final note. I wish I knew what kind of drugs I was on when I first thought that the sizes were so much smaller. Because they weren't. I got to thinking that "gee, it's kind of weird that if you don't specify anything for TermVector when creating a field, you get all this advanced stuff. If it

Re: Omitting TermVector info and index size

2007-02-14 Thread Erick Erickson
It's always embarrassing when the correct unit test takes, say, 3 minutes to write and I've engaged in all this angst that I could have dispelled all by myself (although it is nice to have confirmation from folks in the know). The answer is that omitting term vectors has no influence on the behav

Re: Omitting TermVector info and index size

2007-02-14 Thread Mark Miller
My apologies to Erik...and Erick...I am horrible with names. If I am reading Grant's email correctly, he also said you don't need to store the Term Vectors...just that if you did store them, you can use them with the highlighter so that you do not need to reanalyze the text...why exactly this

Re: Omitting TermVector info and index size

2007-02-14 Thread Erick Erickson
Thanks for that addition, it may well be important to me (as well as pointing up a weakness in my unit tests. Honest, I've been thinking about explicitly testing this. Really. I'll get around to it real soon now. Truly). We store multiple entries in the same field, think of it as storing a lis

Re: Omitting TermVector info and index size

2007-02-14 Thread Mark Miller
As Erick said, Term positions are kept regardless of whether you store term vectors. The positional information is needed for phrase queries, span queries, etc. You certainly don't lose the ability to use phrase queries if you do not store term vectors. If you check out the Posting class in Doc

Re: Omitting TermVector info and index size

2007-02-14 Thread Grant Ingersoll
As Erik stated, you don't need term vectors to do spans, but I thought I would add a bit on the difference between positions and offsets. Positions are what is stored in Lucene internally (see Token.getPositionIncrement() and TermPositions) and are usually just consecutive integers (altho

Re: Omitting TermVector info and index size

2007-02-14 Thread Erick Erickson
Erik Hatcher sez no. Erick On 2/14/07, karl wettin <[EMAIL PROTECTED]> wrote: 14 feb 2007 kl. 15.03 skrev Erick Erickson: > My reasoning was that I do need position information since I need > to do Span > queries, but character information (WITH_OFFSETS) isn't necessary > here/now. > So I t

Re: Omitting TermVector info and index size

2007-02-14 Thread karl wettin
14 feb 2007 kl. 15.03 skrev Erick Erickson: My reasoning was that I do need position information since I need to do Span queries, but character information (WITH_OFFSETS) isn't necessary here/now. So I thought I'd make a small test to see if this was worth pursuing. If omitting offsets ha

Re: Omitting TermVector info and index size

2007-02-14 Thread Erick Erickson
You've made me a happy man . Thanks again. [EMAIL PROTECTED] . On 2/14/07, Erik Hatcher <[EMAIL PROTECTED]> wrote: On Feb 14, 2007, at 9:03 AM, Erick Erickson wrote: > My reasoning was that I do need position information since I need > to do Span > queries, but character information (WITH_OF

Re: Omitting TermVector info and index size

2007-02-14 Thread Erik Hatcher
On Feb 14, 2007, at 9:03 AM, Erick Erickson wrote: My reasoning was that I do need position information since I need to do Span queries, but character information (WITH_OFFSETS) isn't necessary here/now. 1> Am I going off a cliff here? I suppose this is really answered by 2> what is the d

Omitting TermVector info and index size

2007-02-14 Thread Erick Erickson
I'm indexing books, with a significant amount of overhead in each document and a LOT of OCR data. I'm indexing over 20,000 books and the index size is 8G. So I decided to play around with not storing some of the termvector information and I'm shocked at how much smaller the index is. By storing al