Hi Mike, Any more comments on this issue?
Thanks and best regards, Lisheng -----Original Message----- From: Zhang, Lisheng [mailto:lisheng.zh...@broadvision.com] Sent: Friday, August 02, 2013 7:55 AM To: java-user@lucene.apache.org Subject: RE: lucene 4.3 seems to be much slower in indexing than lucene 3.6? I should have mentioned the commands I used to test: 1/ Index: java TestReal36 index -fileFolder /home/cvsupport/lzhang/files -optimize false java TestReal43 index -fileFolder /home/cvsupport/lzhang/files -optimize false The fileFolder contains 8 *.txt files with UTF-8 encoding. I also tried with different parameters -luceneDir (by default lucene chose MMap). 2/ Search: java TestReal36 search -query boxer,snowball -limit 2 -searchRound 2 -reuseSearcher true java TestReal43 search -query boxer,snowball -limit 2 -searchRound 2 -reuseSearcher true I also tried with different parameters -luceneDir Thanks and best regards, Lisheng -----Original Message----- From: Zhang, Lisheng Sent: Thursday, August 01, 2013 11:16 AM To: 'java-user@lucene.apache.org' Subject: RE: lucene 4.3 seems to be much slower in indexing than lucene 3.6? Hi Mike, First I really appreciate your help (for non commercial product)!! 1/ I attached source code of my testing (you see I used StandardAnalyzer), also from CheckIndex report below the unique terms are identical (token counts are slightly different). The stored field is just ID (1 - 8 for each document). The indexed files are from 8 typical files for 8 different languages (English one is "Animal Farm" by George Orwell). Sure I donot mind sending the text files in case you are interested? The query I issued is a trivial one (did not even use filter, like querying "boxer" to get "Animal Farm") 2/ CheckIndex output: /// 361: root@ec2usevmsstgamq:/home/cvsupport/lzhang/code/solr36# java org.apache.lucene.index.CheckIndex /home/cvsupport/lzhang/index36 NOTE: testing will be more thorough if you run java with '-ea:org.apache.lucene...', so assertions are enabled Opening index @ /home/cvsupport/lzhang/index36 Segments file=segments_1 numSegments=1 version=3.6.1 format=FORMAT_3_1 [Lucene 3.1+] 1 of 1: name=_0 docCount=8 compound=false hasProx=true numFiles=11 size (MB)=1.156 diagnostics = {os.version=3.2.0-49-virtual, os=Linux, lucene.version=3.6.1 1362471 - thetaphi - 2012-07-17 12:40:12, source=flush, os.arch=amd64, java.version=1.7.0_25, java.vendor=Oracle Corporation} no deletions test: open reader.........OK test: fields..............OK [2 fields] test: field norms.........OK [1 fields] test: terms, freq, prox...OK [38611 terms; 42557 terms/docs pairs; 335024 tokens] test: stored fields.......OK [8 total field count; avg 1 fields per doc] test: term vectors........OK [8 total vector count; avg 1 term/freq vector fields per doc] No problems were detected with this index. /// 430 root@ec2usevmsstgamq:/home/cvsupport/lzhang/code/solr43# java org.apache.lucene.index.CheckIndex /home/cvsupport/lzhang/index43 NOTE: testing will be more thorough if you run java with '-ea:org.apache.lucene...', so assertions are enabled Opening index @ /home/cvsupport/lzhang/index43 Segments file=segments_1 numSegments=1 version=4.3 format= 1 of 1: name=_0 docCount=8 codec=Lucene42 compound=false numFiles=13 size (MB)=1.742 diagnostics = {timestamp=1375311061843, os=Linux, os.version=3.2.0-49-virtual, source=flush, lucene.version=4.3.0 1477023 - simonw - 2013-04-29 14:55:14, os.arch=amd64, java.version=1.7.0_25, java.vendor=Oracle Corporation} no deletions test: open reader.........OK test: fields..............OK [2 fields] test: field norms.........OK [1 fields] test: terms, freq, prox...OK [38611 terms; 42557 terms/docs pairs; 335016 tokens] test: stored fields.......OK [8 total field count; avg 1 fields per doc] test: term vectors........OK [8 total vector count; avg 1 term/freq vector fields per doc] test: docvalues...........OK [0 total doc count; 0 docvalues fields] No problems were detected with this index. -----Original Message----- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Thursday, August 01, 2013 10:45 AM To: Lucene Users Subject: Re: lucene 4.3 seems to be much slower in indexing than lucene 3.6? On Wed, Jul 31, 2013 at 7:17 PM, Zhang, Lisheng <lisheng.zh...@broadvision.com> wrote: > > Hi Mike, > > I retested and results are the same: > > 1/ I did not use sort (so FieldCache should not enter picture?) No grouping or joining either (they will use FieldCache, if it's not against a doc values field). What sort of queries are you running? > 2/ I created indexed data from scratch separately for 361 and 43 > based on same text (text files), and I ran test from command > line separately against each index folder, so seems a pretty > fair test. OK. > 3/ Each test I created searcher from scrath (to measure creation > time). I did not include JVM start time in each case. The > tests are in same box. OK. > From indexed data it seems that 43 generated a lot more data in > folder, below I listed (ls -ltr) result This is very odd: the 4.3 index is quite a bit larger than the 3.x index. Are you certain the two indexed the same content in the same way? Which analyzer are you using? Maybe run CheckIndex against each index and post the output? > (always pass in LUCENE_43 > version, so lucen 42 codec should be used, why lucene41?). This is fine: the Lucene42 codec uses Lucene41PostingsFormat. Mike McCandless http://blog.mikemccandless.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org