Finally got back to this. The great bulk of the time is spent
parsing/tokenizing. So, using 10 threads parsing/analyzing the 4.5M
docs and feeding them to an IndexWriter took 106 minutes including a
final optimization. The index is 5.6 GB. I'm tempted to try multiple
indexing threads but my guess is it won't buy that much since the async
writer more than kept up with the thread queue.
Now, I'm even more impressed with 2.3!
-Gary
Michael McCandless wrote:
Thanks for the data point!
This is expected -- alot of work went into increasing IndexWriter's
throughput in 2.3.
Actually, I'd expect even more speedup, if indeed Lucene is the
bottleneck in your app. You could test how much time just
creating/parsing & tokenizing the docs (from whatever is holding them)
takes, to see. Also you might eke more performance out following the
suggestions here:
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
Since you've got 4 CPUs and lots of RAM you should definitely use
multiple indexing threads with a large RAM buffer.
Mike
Gary Moore wrote:
Parsing and indexing 4.5 million MARC/XML bibliographic records was
requiring ~14 hrs. using 2.2. The same job using 2.3 takes ~ 5 hrs.
on the same platform -- a quad processor Sun V440 w/8GB memory.
I'm using the PerFieldAnalyzerWrapper (StandardAnalyzer and
SnowballAnalyzer).
I'm impressed! Is this typical?
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]