Same experience here as Tom. Disk I/O becomes bottleneck with large indexes (or multiple shards per server) with less memory. Frequent updates to indexes can make the I/O bottleneck worse.
Peter On Mon, Feb 15, 2010 at 2:17 PM, Tom Burton-West <tburtonw...@gmail.com>wrote: > > Hi Chris, > > In our experience with large indexes (about 200-300GB) , we found most of > our bottlenecks involved disk I/O. We found that if our experimental > indexes were too small, that much of the index could fit in cache, and so > our test results were not applicable to our larger indexes. On the other > hand, once we started building our test indexes so they were significantly > larger than the amount of memory available for OS disk caching, we could > see > results that extrapolated out to the large index. > > Tom Burton-West > www.hathitrust.org > > > ryguasu wrote: > > > > I'd like to try some experiments to see if I can improve search > > performance by changing analysis (e.g. adding/removing word bigrams or > > commongrams), or by changing how I map my source records into Lucene > > documents. The problem is that my index currently is about 1TB in size > > and takes about 2-3 weeks to build, so if I have to rebuild the entire > > index in order to test each potential improvement, then I'm going to > > be waiting around a lot. > > > > One option is to test potential performance improvements by building > > indexes not for the full dataset, but rather for, say, a 1% sample of > > the full dataset. (That is, I'll just index 1% of the source records.) > > I would build one small control index, and then n small test indexes, > > one for each intervention I wish to try. The hope would be that, if an > > indexing intervention significantly improves performance for the small > > indexes, then it would also significantly improve performance of the > > full dataset. (Similarly, you'd hope that if an intervention *didn't* > > significantly improve performance on the small indexes, then it would > > *not* significantly improve performance of the full dataset.) This > > would allow me to quickly accept and reject interventions (as least > > provisionally), and only try applying the most obviously promising > > ones to the full dataset. > > > > Any thoughts on how naive this is? Does it sound more like a way to > > save time, or like a way to waste time misleading myself? > > > > Cheers, > > Chris > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > -- > View this message in context: > http://old.nabble.com/Can-you-use-reduced-sized-test-indexes-to-predict-performance-gains--for-a-larger-index--tp27571524p27598602.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >