hey, so what I wonder in general is if the benchmarks are comparable. What I mean is that the benchmark code has changed since 2.4 a lot so there might be additional fields and / or different settings on what to index and how. could you check with luke if the index has the same fields and if the settings are the same / similar and report it back? I also wonder if it maybe now uses update instead of add ie. buffers and applies deletes etc.
simon On Mon, Dec 12, 2011 at 10:03 PM, Sean Tong <st...@jamasoftware.com> wrote: > Thanks Simon for your response. > > I just re-ran the 3.5 benchmark with the ClassicAnalyzer. Here are the > results: > > [java] ------------> Report sum by Prefix (MAddDocs) and Round (3 about 3 > out of 14) > [java] Operation round flush mrg runCnt recsPerRun rec/s > elapsedSec avgUsedMem avgTotalMem > [java] MAddDocs_200000 0 16.00 10 1 200000 715.76 > 279.42 48,828,144 128,057,344 > [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - - 679.04 > - - 294.53 - 68,321,424 - 85,721,088 > [java] MAddDocs_200000 2 16.00 10 1 200000 761.95 > 262.49 63,139,256 91,881,472 > > The performance is slightly better than the one using StandardAnalyzer, but > this is still much worse than the performance with 2.4.1. > > Sean > > -----Original Message----- > From: Simon Willnauer [mailto:simon.willna...@googlemail.com] > Sent: Monday, December 12, 2011 12:20 PM > To: java-user@lucene.apache.org > Subject: Re: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia > data? > > hey, > > can you try to use the ClassicAnalyzer instead of StandartAnalzyer in > 3.5 since in 3.5 the StandartAnalyzer is a different implementation than in > 2.9 and 2.4 or rerun the 2.4 benchmarks with a WhitespaceAnalyzer just for > the comparison. > > simon > > On Mon, Dec 12, 2011 at 7:08 PM, Sean Tong <st...@jamasoftware.com> wrote: >> Looks like the attachment for the algorithm is missing from last email. I >> have pasted the text here. Thanks in advance for any help. >> >> #Start of the wikipedia-default.alg file >> >> merge.factor=mrg:10:10:10 >> max.field.length=2147483647 >> #max.buffered=buf:10:10:100:100 >> ram.flush.mb=flush:16:16:16 >> >> compound=true >> >> analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer >> directory=FSDirectory >> >> doc.stored=true >> doc.tokenized=true >> doc.term.vector=false >> log.step=5000 >> >> docs.file=temp/enwiki-20070527-pages-articles.xml >> >> content.source=org.apache.lucene.benchmark.byTask.feeds.EnwikiContentS >> ource >> >> query.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersQueryMaker >> >> # task at this depth or less would print when they start >> task.max.depth.log=2 >> >> log.queries=false >> # >> ---------------------------------------------------------------------- >> --------------- >> >> { "Rounds" >> >> ResetSystemErase >> >> { "Populate" >> CreateIndex >> { "MAddDocs" AddDoc > : 200000 >> CloseIndex >> } >> >> NewRound >> >> } : 3 >> >> RepSumByName >> RepSumByPrefRound MAddDocs >> >> #End of wikipedia-default.alg file >> >> Thanks, >> >> Sean >> >> >> From: Sean Tong [mailto:st...@jamasoftware.com] >> Sent: Sunday, December 11, 2011 11:54 PM >> To: java-user@lucene.apache.org >> Subject: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data? >> >> Hi, >> >> We plan to upgrade the Lucene library in our application from 2.4.1 to >> 3.5.0. I have been running benchmark tests that come with Lucence. To my >> surprise, I found that the indexing in 3.5.0 is significant slower than >> 2.4.1 for the Wikipedia data. >> >> Attached is the algorithm for the tests. The tests used default Lucence >> settings for flush memory size and merge factor. 512M memory was used for >> the tasks. The test machine is a 64-bit Windows 7 machine with Intel Core >> i7. >> >> The command: >> %ant -Dtask.alg=conf/wikipedia-default.alg -Dtask.mem=512M run-task >> >> Here are the test results: >> >> Lucece 2.4.1 >> >> [java] ------------> Report sum by Prefix (MAddDocs) and Round >> (3 about 3 out of 14) >> >> [java] Operation round flush mrg runCnt recsPerRun >> rec/s elapsedSec avgUsedMem avgTotalMem >> >> [java] MAddDocs_200000 0 16.00 10 1 200000 >> 1,609.1 124.29 89,218,496 241,631,232 >> >> [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - - >> 1,746.4 - - 114.52 - 102,365,864 - 241,762,304 >> >> [java] MAddDocs_200000 2 16.00 10 1 200000 >> 1,566.8 127.65 69,428,144 174,194,688 >> >> >> Lucene 2.9.4 >> >> [java] ------------> Report sum by Prefix (MAddDocs) and Round (3 >> about 3 out of 14) >> >> [java] Operation round flush mrg runCnt recsPerRun >> rec/s elapsedSec avgUsedMem avgTotalMem >> >> [java] MAddDocs_200000 0 16.00 10 1 200000 >> 1,046.49 191.12 82,676,152 139,657,216 >> >> [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - >> 1,165.35 - - 171.62 - 119,364,128 - 156,762,112 >> >> [java] MAddDocs_200000 2 16.00 10 1 200000 >> 1,245.86 160.53 50,361,760 137,625,600 >> >> Lucene 3.5.0 >> >> [java] ------------> Report sum by Prefix (MAddDocs) and Round (3 >> about 3 out of 14) >> >> [java] Operation round flush mrg runCnt recsPerRun >> rec/s elapsedSec avgUsedMem avgTotalMem >> >> [java] MAddDocs_200000 0 16.00 10 1 200000 >> 676.48 295.65 70,917,592 129,695,744 >> >> [java] MAddDocs_200000 - 1 16.00 10 - - 1 - - 200000 - - >> 626.13 - - 319.42 - 50,329,552 - 94,240,768 >> >> [java] MAddDocs_200000 2 16.00 10 1 200000 >> 687.68 290.83 57,732,640 92,864,512 >> >> >> The indexing speed using 2.4.1 is 2.3x of the speed using 3.5.0. Did I >> miss any settings or configurations? >> >> Thanks, >> >> Sean >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org