RE: lucene 4.3 seems to be much slower in indexing than lucene 3.6?

Zhang, Lisheng Fri, 09 Aug 2013 09:57:26 -0700

Hi Mike,

Thanks very much for your insightful comments, I will try to test more.


Best regards, Lisheng

-----Original Message-----
From: Michael McCandless [mailto:luc...@mikemccandless.com]
Sent: Friday, August 09, 2013 9:46 AM
To: Lucene Users
Subject: Re: lucene 4.3 seems to be much slower in indexing than lucene
3.6?


Hi, sorry, I don't have enough time to drill deeper here (run your
benchmark), but some quick ideas:

Only 8 documents is really a tiny index; try testing on many more documents?

Also, I would run more rounds than just 2; better to run 10s of rounds
and watch for the time per round to "stabilize" as hotspot finishes
compiling the hot spots...

It's curious that your CheckIndex output is so similar yet the index
sizes are so different; I wonder if you make a larger index if that
still holds.  It could be the block compression in 4.x is less space
efficient when there are a tiny number of documents.

Mike McCandless

http://blog.mikemccandless.com


On Fri, Aug 9, 2013 at 11:55 AM, Zhang, Lisheng
<lisheng.zh...@broadvision.com> wrote:
> Hi Mike,
>
> Any more comments on this issue?
>
> Thanks and best regards, Lisheng
>
> -----Original Message-----
> From: Zhang, Lisheng [mailto:lisheng.zh...@broadvision.com]
> Sent: Friday, August 02, 2013 7:55 AM
> To: java-user@lucene.apache.org
> Subject: RE: lucene 4.3 seems to be much slower in indexing than lucene
> 3.6?
>
>
> I should have mentioned the commands I used to test:
>
> 1/ Index:
> java TestReal36 index -fileFolder /home/cvsupport/lzhang/files -optimize false
> java TestReal43 index -fileFolder /home/cvsupport/lzhang/files -optimize false
>
> The fileFolder contains 8 *.txt files with UTF-8 encoding. I also tried with
> different parameters -luceneDir (by default lucene chose MMap).
>
> 2/ Search:
> java TestReal36 search -query boxer,snowball -limit 2 -searchRound 2 
> -reuseSearcher true
> java TestReal43 search -query boxer,snowball -limit 2 -searchRound 2 
> -reuseSearcher true
>
> I also tried with different parameters -luceneDir
>
> Thanks and best regards, Lisheng
>
> -----Original Message-----
> From: Zhang, Lisheng
> Sent: Thursday, August 01, 2013 11:16 AM
> To: 'java-user@lucene.apache.org'
> Subject: RE: lucene 4.3 seems to be much slower in indexing than lucene
> 3.6?
>
>
> Hi Mike,
>
> First I really appreciate your help (for non commercial product)!!
>
> 1/ I attached source code of my testing (you see I used StandardAnalyzer), 
> also from CheckIndex report below
>    the unique terms are identical (token counts are slightly different). The 
> stored field is just ID (1 - 8
>    for each document). The indexed files are from 8 typical files for 8 
> different languages (English one is
>    "Animal Farm" by George Orwell). Sure I donot mind sending the text files 
> in case you are interested?
>
>    The query I issued is a trivial one (did not even use filter, like 
> querying "boxer" to get "Animal Farm")
>
> 2/ CheckIndex output:
>
> /// 361:
> root@ec2usevmsstgamq:/home/cvsupport/lzhang/code/solr36# java 
> org.apache.lucene.index.CheckIndex /home/cvsupport/lzhang/index36
>
> NOTE: testing will be more thorough if you run java with 
> '-ea:org.apache.lucene...', so assertions are enabled
>
> Opening index @ /home/cvsupport/lzhang/index36
>
> Segments file=segments_1 numSegments=1 version=3.6.1 format=FORMAT_3_1 
> [Lucene 3.1+]
>   1 of 1: name=_0 docCount=8
>     compound=false
>     hasProx=true
>     numFiles=11
>     size (MB)=1.156
>     diagnostics = {os.version=3.2.0-49-virtual, os=Linux, 
> lucene.version=3.6.1 1362471 - thetaphi - 2012-07-17 12:40:12, source=flush, 
> os.arch=amd64, java.version=1.7.0_25, java.vendor=Oracle Corporation}
>     no deletions
>     test: open reader.........OK
>     test: fields..............OK [2 fields]
>     test: field norms.........OK [1 fields]
>     test: terms, freq, prox...OK [38611 terms; 42557 terms/docs pairs; 335024 
> tokens]
>     test: stored fields.......OK [8 total field count; avg 1 fields per doc]
>     test: term vectors........OK [8 total vector count; avg 1 term/freq 
> vector fields per doc]
>
> No problems were detected with this index.
>
> /// 430
> root@ec2usevmsstgamq:/home/cvsupport/lzhang/code/solr43# java 
> org.apache.lucene.index.CheckIndex /home/cvsupport/lzhang/index43
>
> NOTE: testing will be more thorough if you run java with 
> '-ea:org.apache.lucene...', so assertions are enabled
>
> Opening index @ /home/cvsupport/lzhang/index43
>
> Segments file=segments_1 numSegments=1 version=4.3 format=
>   1 of 1: name=_0 docCount=8
>     codec=Lucene42
>     compound=false
>     numFiles=13
>     size (MB)=1.742
>     diagnostics = {timestamp=1375311061843, os=Linux, 
> os.version=3.2.0-49-virtual, source=flush, lucene.version=4.3.0 1477023 - 
> simonw - 2013-04-29 14:55:14, os.arch=amd64, java.version=1.7.0_25, 
> java.vendor=Oracle Corporation}
>     no deletions
>     test: open reader.........OK
>     test: fields..............OK [2 fields]
>     test: field norms.........OK [1 fields]
>     test: terms, freq, prox...OK [38611 terms; 42557 terms/docs pairs; 335016 
> tokens]
>     test: stored fields.......OK [8 total field count; avg 1 fields per doc]
>     test: term vectors........OK [8 total vector count; avg 1 term/freq 
> vector fields per doc]
>     test: docvalues...........OK [0 total doc count; 0 docvalues fields]
>
> No problems were detected with this index.
>
>
>
> -----Original Message-----
> From: Michael McCandless [mailto:luc...@mikemccandless.com]
> Sent: Thursday, August 01, 2013 10:45 AM
> To: Lucene Users
> Subject: Re: lucene 4.3 seems to be much slower in indexing than lucene
> 3.6?
>
>
> On Wed, Jul 31, 2013 at 7:17 PM, Zhang, Lisheng
> <lisheng.zh...@broadvision.com> wrote:
>>
>> Hi Mike,
>>
>> I retested and results are the same:
>>
>> 1/ I did not use sort (so FieldCache should not enter picture?)
>
> No grouping or joining either (they will use FieldCache, if it's not
> against a doc values field).
>
> What sort of queries are you running?
>
>> 2/ I created indexed data from scratch separately for 361 and 43
>>    based on same text (text files), and I ran test from command
>>    line separately against each index folder, so seems a pretty
>>    fair test.
>
> OK.
>
>> 3/ Each test I created searcher from scrath (to measure creation
>>    time). I did not include JVM start time in each case. The
>>    tests are in same box.
>
> OK.
>
>> From indexed data it seems that 43 generated a lot more data in
>> folder, below I listed (ls -ltr) result
>
> This is very odd: the 4.3 index is quite a bit larger than the 3.x
> index.  Are you certain the two indexed the same content in the same
> way?  Which analyzer are you using?  Maybe run CheckIndex against each
> index and post the output?
>
>> (always pass in LUCENE_43
>> version, so lucen 42 codec should be used, why lucene41?).
>
> This is fine: the Lucene42 codec uses Lucene41PostingsFormat.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: lucene 4.3 seems to be much slower in indexing than lucene 3.6?

Reply via email to