Re: 2.3.2 Indexing Performance

2008-10-01 Thread Michael McCandless
Awesome! Thanks for following up. Mike Gary Moore wrote: Finally got back to this. The great bulk of the time is spent parsing/tokenizing. So, using 10 threads parsing/analyzing the 4.5M docs and feeding them to an IndexWriter took 106 minutes including a final optimization. The ind

Re: 2.3.2 Indexing Performance

2008-10-01 Thread Gary Moore
Finally got back to this. The great bulk of the time is spent parsing/tokenizing. So, using 10 threads parsing/analyzing the 4.5M docs and feeding them to an IndexWriter took 106 minutes including a final optimization. The index is 5.6 GB. I'm tempted to try multiple indexing threads but

Re: 2.3.2 Indexing Performance

2008-08-08 Thread Michael McCandless
Thanks for the data point! This is expected -- alot of work went into increasing IndexWriter's throughput in 2.3. Actually, I'd expect even more speedup, if indeed Lucene is the bottleneck in your app. You could test how much time just creating/ parsing & tokenizing the docs (from whatev

2.3.2 Indexing Performance

2008-08-08 Thread Gary Moore
Parsing and indexing 4.5 million MARC/XML bibliographic records was requiring ~14 hrs. using 2.2. The same job using 2.3 takes ~ 5 hrs. on the same platform -- a quad processor Sun V440 w/8GB memory. I'm using the PerFieldAnalyzerWrapper (StandardAnalyzer and SnowballAnalyzer). I'm impress