2008/10/23 Michael McCandless <[EMAIL PROTECTED]>: > > Mark Miller wrote: > >> Glen Newton wrote: >>> >>> 2008/10/23 Mark Miller <[EMAIL PROTECTED]>: >>> >>>> It sounds like you might have some thread synchronization issues outside >>>> of >>>> Lucene. To simplify things a bit, you might try just using one >>>> IndexWriter. >>>> If I remember right, the IndexWriter is now pretty efficient, and there >>>> isn't much need to index to smaller indexes and then merge. There is a >>>> lot >>>> of juggling to get wrong with that approach. >>>> >>> >>> While I agree it is easier to have a single IndexWriter, if you have >>> multiple cores you will get significant speed-ups with multiple >>> IndexWriters, even with the impact of merging at the end. >>> #IndexWriters = # physical cores is an reasonable rule of thumb. >>> >>> General speed-up estimate: # cores * 0.6 - 0.8 over single IndexWriter >>> YMMV >>> >>> When I get around to it, I'll re-run my tests varying the # of >>> IndexWriters & post. >>> >>> -Glen >>> >> Hey Mr McCandless, whats up with that? Can IndexWriter be made to be as >> efficient as using Multiple Writers? Where do you suppose the hold up is? >> Number of threads doing merges? Sync contention? I hate the idea of multiple >> IndexWriter/Readers being more efficient than a single instance. In an ideal >> Lucene world, a single instance would hide the complexity and use the number >> of threads needed to match multiple instance performance. > > Honestly this surprises me: I would expect a single IndexWriter with > multiple threads to be as fast (or faster, considering the extra merge time > at the end) than multiple IndexWriters. > > IndexWriter's concurrency has improved alot lately, with > ConcurrentMergeScheduler. The only serious operation that is not concurrent > is flushing the RAM buffer as a new segment; but in a well tuned indexing > process (large RAM buffer) the time spent there should be quite small, > especially with a fast IO system. > > Actually, addIndexes is also not concurrent in that if multiple threads call > it, only one can run at once. But normally you would call it with all the > indices you want to add, and then the merging is concurrent. > > Glen, in your single IndexWriter test, is it possible there was accidental > thread contention during document preparation or analysis?
I don't think there is. I've been refining this for quite a while, and have done a lot of analysis and hand-checking of the threading stuff. I do use multiple threads for document creation: this is where much of the speed-up happens (at least in my case where I have a large indexed field for the full-text of an article: the parsing becomes a significant part of the process). > I do agree that we should strive to have enough concurrency in IndexWriter > and IndexReader so that you don't get any real benefit by using separate > instances. Eg in 2.4.0 you can now open read-only IndexReaders, and on Unix > you can use NIOFSDirectory, both of which should go a long ways towards > fixing IndexReader's concurrency issue. My original tests were in the Spring with 2.3.1. I am planning on doing the new tests with 2.4 for indexing, as well as re-doing my concurrent query tests[1] and concurrent multiple reader tests[2] using the features you describe. I am sure the results will be quite different... BTW the files I am indexing were originally PDFs, but were batch converted to text and stored compressed on the filesystem, so except for GUnzipping them there is no other overhead. [1]http://zzzoot.blogspot.com/2008/06/simultaneous-threaded-query-lucene.html [2]http://zzzoot.blogspot.com/2008/06/lucene-concurrent-search-performance.html -glen > Mike > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- - --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]