Looks like the attachment for the algorithm is missing from last email. I have pasted the text here. Thanks in advance for any help.
#Start of the wikipedia-default.alg file merge.factor=mrg:10:10:10 max.field.length=2147483647 #max.buffered=buf:10:10:100:100 ram.flush.mb=flush:16:16:16 compound=true analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer directory=FSDirectory doc.stored=true doc.tokenized=true doc.term.vector=false log.step=5000 docs.file=temp/enwiki-20070527-pages-articles.xml content.source=org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource query.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersQueryMaker # task at this depth or less would print when they start task.max.depth.log=2 log.queries=false # ------------------------------------------------------------------------------------- { "Rounds" ResetSystemErase { "Populate" CreateIndex { "MAddDocs" AddDoc > : 200000 CloseIndex } NewRound } : 3 RepSumByName RepSumByPrefRound MAddDocs #End of wikipedia-default.alg file Thanks, Sean From: Sean Tong [mailto:st...@jamasoftware.com] Sent: Sunday, December 11, 2011 11:54 PM To: java-user@lucene.apache.org Subject: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data? Hi, We plan to upgrade the Lucene library in our application from 2.4.1 to 3.5.0. I have been running benchmark tests that come with Lucence. To my surprise, I found that the indexing in 3.5.0 is significant slower than 2.4.1 for the Wikipedia data. Attached is the algorithm for the tests. The tests used default Lucence settings for flush memory size and merge factor. 512M memory was used for the tasks. The test machine is a 64-bit Windows 7 machine with Intel Core i7. The command: %ant -Dtask.alg=conf/wikipedia-default.alg -Dtask.mem=512M run-task Here are the test results: Lucece 2.4.1 [java] ------------> Report sum by Prefix (MAddDocs) and Round (3 about 3 out of 14) [java] Operation round flush mrg runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem [java] MAddDocs_200000 0 16.00 10 1 200000 1,609.1 124.29 89,218,496 241,631,232 [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - - 1,746.4 - - 114.52 - 102,365,864 - 241,762,304 [java] MAddDocs_200000 2 16.00 10 1 200000 1,566.8 127.65 69,428,144 174,194,688 Lucene 2.9.4 [java] ------------> Report sum by Prefix (MAddDocs) and Round (3 about 3 out of 14) [java] Operation round flush mrg runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem [java] MAddDocs_200000 0 16.00 10 1 200000 1,046.49 191.12 82,676,152 139,657,216 [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - 1,165.35 - - 171.62 - 119,364,128 - 156,762,112 [java] MAddDocs_200000 2 16.00 10 1 200000 1,245.86 160.53 50,361,760 137,625,600 Lucene 3.5.0 [java] ------------> Report sum by Prefix (MAddDocs) and Round (3 about 3 out of 14) [java] Operation round flush mrg runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem [java] MAddDocs_200000 0 16.00 10 1 200000 676.48 295.65 70,917,592 129,695,744 [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - - 626.13 - - 319.42 - 50,329,552 - 94,240,768 [java] MAddDocs_200000 2 16.00 10 1 200000 687.68 290.83 57,732,640 92,864,512 The indexing speed using 2.4.1 is 2.3x of the speed using 3.5.0. Did I miss any settings or configurations? Thanks, Sean