Your math is right -- looks like it really is ~9 bytes per term
(assuming no bugs in CheckIndex!).

How long did this CheckIndex take to run...?

On the file format, one correction: if the docFreq is < skipInterval
(default 16) then there is no skip data and we don't write the
SkipDelta.

The vast majority of your terms will have docFreq < 16, so for these
terms it's 6 bytes minimum (6 vInt/vLongs), then the character data
(in UTF8 bytes) for the suffix.  Terms w/ skip data would be 7 bytes
minimum, for the vInt/vLongs.

So I think really does mean "on average" your adjacent terms only
differ by 3 byte suffix, which is interesting.  You could make a small
test, which enums all terms, and prints ones whose new suffix (vs
prior terms) is <= 3 bytes, to gain some insight.

I'd really love to see your index, indexed on trunk ;)  The terms
index is much smaller than in 3.x!

Mike

http://blog.mikemccandless.com

On Mon, Mar 21, 2011 at 1:15 PM, Burton-West, Tom <tburt...@umich.edu> wrote:
> I'm trying to get a feel for the impact of changing the termIndexInterval 
> from the default of 128 to 1024 (8 * 128).  This reduces the size of the tii 
> file by 1/8th but in the worst case requires doing a linear scan of 1024 
> terms instead of 128 in memory.   I'm not so concerned about the performance 
> impact of the in-memory scan, but I was trying to get an idea about how this 
> affects disk I/O. i.e. assuming a term is not in the tii file, we need to  
> load 1024 terms from the tis file instead of 128.
>
> I looked at the output of a CheckIndex on one of our very large segments to 
> get the number of terms in the segment (see below) and got about 2.7 billion 
> terms. (We have lots of dirty OCR from 400 languages) .  The tis file is 
> about  24.7 GB. I divided the size of the tis file for that segment in bytes 
> by the number of terms to get the average number of bytes/term:
>
> (24.7 * (10^9) bytes ) / (2.7 * (10^9) terms) = 9 bytes/term.
>
> This is the average size of a term entry in the tis file (assuming CheckIndex 
> and ls outputs are correct).
> This seems too small.   Looking at the Lucene File formats doc (excerpt 
> below), if we assume that everything other than the Suffix of the term takes 
> a VInt that only occupies 1 byte, we have 6 bytes for that data, which leaves 
> only 3 bytes for the String that holds the Suffix.
>
> What am I missing here?
>
> Tom Burton-West
>
>
> -------------------------------------------------------------------------------------------------------
>
> From the Lucene File formats doc:
>
> TermInfo --> <Term, DocFreq, FreqDelta, ProxDelta, SkipDelta>
> Term --> <PrefixLength, Suffix, FieldNum>
> Suffix --> String
> PrefixLength, DocFreq, FreqDelta, ProxDelta, SkipDelta
> --> VInt
>
> 1 of 2: name=_2cj docCount=708,639
>    compound=false
>    hasProx=true
>    numFiles=9
>    size (MB)=393,395.313
>    diagnostics = {optimize=true, mergeFactor=9, 
> os.version=2.6.18-238.1.1.el5, os=Linux, mergeDocStores=true, lu
> cene.version=3.1-SNAPSHOT 1036094 - 2010-11-19 16:01:10, source=merge, 
> os.arch=amd64, java.version=1.6.0_20, java
> .vendor=Sun Microsystems Inc.}
>    has deletions [delFileName=_2cj_2.del]
>    test: open reader.........OK [24 deleted docs]
>    test: fields..............OK [55 fields]
>    test: field norms.........OK [17 fields]
>    test: terms, freq, prox...OK [2,723,440,775 terms; 35740903735 terms/docs 
> pairs; 154861967859 tokens]
>    test: stored fields.......OK [11040443 total field count; avg 15.58 fields 
> per doc]
>    test: term vectors........OK [0 total vector count; avg 0 term/freq vector 
> fields per doc]
>
> [xxx@shotz-1 index]$ ls -l _2cj.tis
> -rw-rw-r-- 1 tomcat dlps 24,775,378,328 Mar 12 17:16 _2cj.tis
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to