hey, can you try to use the ClassicAnalyzer instead of StandartAnalzyer in 3.5 since in 3.5 the StandartAnalyzer is a different implementation than in 2.9 and 2.4 or rerun the 2.4 benchmarks with a WhitespaceAnalyzer just for the comparison.
simon On Mon, Dec 12, 2011 at 7:08 PM, Sean Tong <st...@jamasoftware.com> wrote: > Looks like the attachment for the algorithm is missing from last email. I > have pasted the text here. Thanks in advance for any help. > > #Start of the wikipedia-default.alg file > > merge.factor=mrg:10:10:10 > max.field.length=2147483647 > #max.buffered=buf:10:10:100:100 > ram.flush.mb=flush:16:16:16 > > compound=true > > analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer > directory=FSDirectory > > doc.stored=true > doc.tokenized=true > doc.term.vector=false > log.step=5000 > > docs.file=temp/enwiki-20070527-pages-articles.xml > > content.source=org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource > > query.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersQueryMaker > > # task at this depth or less would print when they start > task.max.depth.log=2 > > log.queries=false > # > ------------------------------------------------------------------------------------- > > { "Rounds" > > ResetSystemErase > > { "Populate" > CreateIndex > { "MAddDocs" AddDoc > : 200000 > CloseIndex > } > > NewRound > > } : 3 > > RepSumByName > RepSumByPrefRound MAddDocs > > #End of wikipedia-default.alg file > > Thanks, > > Sean > > > From: Sean Tong [mailto:st...@jamasoftware.com] > Sent: Sunday, December 11, 2011 11:54 PM > To: java-user@lucene.apache.org > Subject: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data? > > Hi, > > We plan to upgrade the Lucene library in our application from 2.4.1 to 3.5.0. > I have been running benchmark tests that come with Lucence. To my surprise, > I found that the indexing in 3.5.0 is significant slower than 2.4.1 for the > Wikipedia data. > > Attached is the algorithm for the tests. The tests used default Lucence > settings for flush memory size and merge factor. 512M memory was used for > the tasks. The test machine is a 64-bit Windows 7 machine with Intel Core i7. > > The command: > %ant -Dtask.alg=conf/wikipedia-default.alg -Dtask.mem=512M run-task > > Here are the test results: > > Lucece 2.4.1 > > [java] ------------> Report sum by Prefix (MAddDocs) and Round (3 about > 3 out of 14) > > [java] Operation round flush mrg runCnt recsPerRun rec/s > elapsedSec avgUsedMem avgTotalMem > > [java] MAddDocs_200000 0 16.00 10 1 200000 1,609.1 > 124.29 89,218,496 241,631,232 > > [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - - 1,746.4 > - - 114.52 - 102,365,864 - 241,762,304 > > [java] MAddDocs_200000 2 16.00 10 1 200000 1,566.8 > 127.65 69,428,144 174,194,688 > > > Lucene 2.9.4 > > [java] ------------> Report sum by Prefix (MAddDocs) and Round (3 about 3 > out of 14) > > [java] Operation round flush mrg runCnt recsPerRun rec/s > elapsedSec avgUsedMem avgTotalMem > > [java] MAddDocs_200000 0 16.00 10 1 200000 1,046.49 > 191.12 82,676,152 139,657,216 > > [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - 1,165.35 > - - 171.62 - 119,364,128 - 156,762,112 > > [java] MAddDocs_200000 2 16.00 10 1 200000 1,245.86 > 160.53 50,361,760 137,625,600 > > Lucene 3.5.0 > > [java] ------------> Report sum by Prefix (MAddDocs) and Round (3 about 3 > out of 14) > > [java] Operation round flush mrg runCnt recsPerRun rec/s > elapsedSec avgUsedMem avgTotalMem > > [java] MAddDocs_200000 0 16.00 10 1 200000 676.48 > 295.65 70,917,592 129,695,744 > > [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - - 626.13 > - - 319.42 - 50,329,552 - 94,240,768 > > [java] MAddDocs_200000 2 16.00 10 1 200000 687.68 > 290.83 57,732,640 92,864,512 > > > The indexing speed using 2.4.1 is 2.3x of the speed using 3.5.0. Did I > miss any settings or configurations? > > Thanks, > > Sean > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org