Understanding lucene indexes and disk I/O

Burton-West, Tom Mon, 12 Apr 2010 14:57:59 -0700

Hi all,

Please let me know if this should be posted instead to the Lucene java-dev list.


We have very large tis files (about 36 GB).  I have not been too concerned as I 
assumed that due to the indexing of the tis file by the tii file, only a small 
portion of the file needed to be read.  However, upon re-reading the Lucene 
Index File Formats document: 
http://lucene.apache.org/java/3_0_1/fileformats.html#Term%20Dictionary
I am now wondering whether most of the tis file may need to be read if a term 
is near the end of the file.

I'm trying to understand whether lucene has enough information stored in the 
*tii and *tis files to determine what byte offset in the prx file to seek to in 
order to get the freq or positions list for a particular term and whether a 
sequential read of the entire *tis file up to the term being sought is needed 
in order to decode ProxDeltas and determint the byte offset.

 What has me confused is that the Lucene Index File Formats document says:

"ProxDelta determines the position of this term's TermPositions within the .prx 
file. In particular, it is the difference between the position of this term's 
data in that file and the position of the previous term's data (or zero, for 
the first term in the file. For fields with omitTf true, this will be 0 since 
prox information is not stored."

I assumed that Lucene implements a binary search of the tii file, then reads 
the appropriate 128 (IndexDivisor) entries from the tis file and does a binary 
search on that.   (So that should be one disk seek when the searcher starts up 
to read in the entire tii file and then a second disk seek to load in the 
appropriate data for 128 terms from the tis file.  However, once a term's entry 
in the tis file is found, if only the difference between this term and the 
previous term's position in the prx file is stored, it seems that in order to 
get the actual byte offset of a term in the prx file, all of the previous 
term's ProxDelta's  starting at the first term in the file, would have to be 
read and added together.  If that is true then we are talking a sequential read 
of the entire tis file up to the current term.

Is this correct?   Can someone point me to the area of the code base where this 
is implemented ?   Am I missing something here?


Tom Burton-West

Understanding lucene indexes and disk I/O

Reply via email to