On Tue, Apr 13, 2010 at 11:55 AM, Burton-West, Tom <tburt...@umich.edu> wrote:

> At some point maybe the File Formats Document could be updated to make it 
> clear that the tii has an entry similar to the IntexInterval'th tis entry but 
> instead of holding frq/prx deltas it holds absolute pointers.  Is it worth 
> entering a JIRA issue?  I would be happy to update the doc myself, but I'm 
> don't think  I have enough of an in depth understanding.

Well I need to redo this part of file format docs, due to the flex
changes (LUCENE-2371)... I'll add a comment to remember to
state that this means we scan at most 128 terms.

> As you probably have guessed, I'm trying to understand the impact of the over 
> 2.4 billion unique terms in our indexes on performance 
> (https://issues.apache.org/jira/browse/LUCENE-2257).  We suspect that a very 
> large percentage of these terms are due to dirty OCR, but have not yet found 
> a good way to eliminate a significant amount of dirty OCR.

Fun problem ;)

> I assume that these cause a few extra steps in the binary search of the tii 
> file in memory but we probably won't notice that performance impact since our 
> bottleneck is disk I/O for reading long postings lists for frequently 
> occurring terms.

Yeah, and added RAM usage holding the tii in memory.

In theory, one could make a flex codec that instead of indexing every
Nth (128 by default) term, indexes in a non-regular manner, eg at any
term > X in frequency (though nobody has built such a codec!).  In
theory this'd be good for cases like yours... because you don't need
every 128 when many of them are very low count.  Ie the added time to
scan to a rare term is usually a non-issue since iterating that rare
term's docs will be so fast.  But added cost when scanning to a common
term "hurts" because it'll take so long to walk that common term's
docs.

> Am I correct in assuming that even if we have a very large number of garbage 
> terms in our prx file, the overall size of the file does not significantly 
> affect the number of disk seeks or amount of data to be read since Lucene can 
> seek to the beginning of the postings for any particular term?

Right, frq and prx.

>>> I would love to get ahold of your terms dict :)  I'd have a field day
>>>testing Lucene against it... I'm very curious how the flex improvements 
>>>affect your usage.
>
> Sometime in the next month or so we will get our new test server and after I 
> get the backup of testing jobs under control, I'd love to do some testing 
> with flex and our data.

I'd love to see the results of this :) EG basic stats like
before/after size of the tis/tii files, RAM used on opening the
reader, startup up time, search performance...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to