On Tue, Apr 13, 2010 at 11:55 AM, Burton-West, Tom <tburt...@umich.edu> wrote:
> At some point maybe the File Formats Document could be updated to make it > clear that the tii has an entry similar to the IntexInterval'th tis entry but > instead of holding frq/prx deltas it holds absolute pointers. Is it worth > entering a JIRA issue? I would be happy to update the doc myself, but I'm > don't think I have enough of an in depth understanding. Well I need to redo this part of file format docs, due to the flex changes (LUCENE-2371)... I'll add a comment to remember to state that this means we scan at most 128 terms. > As you probably have guessed, I'm trying to understand the impact of the over > 2.4 billion unique terms in our indexes on performance > (https://issues.apache.org/jira/browse/LUCENE-2257). We suspect that a very > large percentage of these terms are due to dirty OCR, but have not yet found > a good way to eliminate a significant amount of dirty OCR. Fun problem ;) > I assume that these cause a few extra steps in the binary search of the tii > file in memory but we probably won't notice that performance impact since our > bottleneck is disk I/O for reading long postings lists for frequently > occurring terms. Yeah, and added RAM usage holding the tii in memory. In theory, one could make a flex codec that instead of indexing every Nth (128 by default) term, indexes in a non-regular manner, eg at any term > X in frequency (though nobody has built such a codec!). In theory this'd be good for cases like yours... because you don't need every 128 when many of them are very low count. Ie the added time to scan to a rare term is usually a non-issue since iterating that rare term's docs will be so fast. But added cost when scanning to a common term "hurts" because it'll take so long to walk that common term's docs. > Am I correct in assuming that even if we have a very large number of garbage > terms in our prx file, the overall size of the file does not significantly > affect the number of disk seeks or amount of data to be read since Lucene can > seek to the beginning of the postings for any particular term? Right, frq and prx. >>> I would love to get ahold of your terms dict :) I'd have a field day >>>testing Lucene against it... I'm very curious how the flex improvements >>>affect your usage. > > Sometime in the next month or so we will get our new test server and after I > get the backup of testing jobs under control, I'd love to do some testing > with flex and our data. I'd love to see the results of this :) EG basic stats like before/after size of the tis/tii files, RAM used on opening the reader, startup up time, search performance... Mike --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org