I'm trying to get a feel for the impact of changing the termIndexInterval from the default of 128 to 1024 (8 * 128). This reduces the size of the tii file by 1/8th but in the worst case requires doing a linear scan of 1024 terms instead of 128 in memory. I'm not so concerned about the performance impact of the in-memory scan, but I was trying to get an idea about how this affects disk I/O. i.e. assuming a term is not in the tii file, we need to load 1024 terms from the tis file instead of 128.
I looked at the output of a CheckIndex on one of our very large segments to get the number of terms in the segment (see below) and got about 2.7 billion terms. (We have lots of dirty OCR from 400 languages) . The tis file is about 24.7 GB. I divided the size of the tis file for that segment in bytes by the number of terms to get the average number of bytes/term: (24.7 * (10^9) bytes ) / (2.7 * (10^9) terms) = 9 bytes/term. This is the average size of a term entry in the tis file (assuming CheckIndex and ls outputs are correct). This seems too small. Looking at the Lucene File formats doc (excerpt below), if we assume that everything other than the Suffix of the term takes a VInt that only occupies 1 byte, we have 6 bytes for that data, which leaves only 3 bytes for the String that holds the Suffix. What am I missing here? Tom Burton-West ------------------------------------------------------------------------------------------------------- >From the Lucene File formats doc: TermInfo --> <Term, DocFreq, FreqDelta, ProxDelta, SkipDelta> Term --> <PrefixLength, Suffix, FieldNum> Suffix --> String PrefixLength, DocFreq, FreqDelta, ProxDelta, SkipDelta --> VInt 1 of 2: name=_2cj docCount=708,639 compound=false hasProx=true numFiles=9 size (MB)=393,395.313 diagnostics = {optimize=true, mergeFactor=9, os.version=2.6.18-238.1.1.el5, os=Linux, mergeDocStores=true, lu cene.version=3.1-SNAPSHOT 1036094 - 2010-11-19 16:01:10, source=merge, os.arch=amd64, java.version=1.6.0_20, java .vendor=Sun Microsystems Inc.} has deletions [delFileName=_2cj_2.del] test: open reader.........OK [24 deleted docs] test: fields..............OK [55 fields] test: field norms.........OK [17 fields] test: terms, freq, prox...OK [2,723,440,775 terms; 35740903735 terms/docs pairs; 154861967859 tokens] test: stored fields.......OK [11040443 total field count; avg 15.58 fields per doc] test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per doc] [xxx@shotz-1 index]$ ls -l _2cj.tis -rw-rw-r-- 1 tomcat dlps 24,775,378,328 Mar 12 17:16 _2cj.tis