Lucene performance bottlenecks

Andrzej Bialecki Fri, 02 Dec 2005 03:54:08 -0800

Hi,

I'm doing some performance profiling of a Nutch installation, workingwith relatively large individual indexes (10 mln docs), and I'm puzzledwith the results.


Here's the listing of the index:
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:24 _0.f0
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:24 _0.f1
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:24 _0.f2
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:24 _0.f3
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:24 _0.f4
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:24 _0.f5
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:25 _0.f6
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:25 _0.f7
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:25 _0.f8
-rw-r--r--  1 andrzej andrzej  2494445020 Dec  2 04:58 _0.fdt
-rw-r--r--  1 andrzej andrzej    78424800 Dec  2 04:58 _0.fdx
-rw-r--r--  1 andrzej andrzej          92 Dec  2 04:55 _0.fnm
-rw-r--r--  1 andrzej andrzej  7436259508 Dec  2 05:24 _0.frq
-rw-r--r--  1 andrzej andrzej 12885589796 Dec  2 05:24 _0.prx
-rw-r--r--  1 andrzej andrzej     3483642 Dec  2 05:24 _0.tii
-rw-r--r--  1 andrzej andrzej   280376933 Dec  2 05:24 _0.tis
-rw-r--r--  1 andrzej andrzej           4 Dec  2 05:25 deletable
-rw-r--r--  1 andrzej andrzej          27 Dec  2 05:25 segments


I run it on an AMD Opteron 246, 2Ghz, 4GB RAM, java -version says:

Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_05-b05, mixed mode)

I run it with a heap of 1.5-2.5 GB, which doesn't make any difference(see below). I'm using the latest SVN code (from yesterday) +performance enhancements to ConjunctionScorer and BooleanScorer2 from JIRA.

The performance is less than impressive, response times being more than1 sec. Nutch produces complex queries for phrases, so the user query"term1 term2" gets rewritten like this:

+(url:term1^4.0 anchor:term1^2.0 content:term1 title:term1^1.5host:term1^2.0) +(url:term2^4.0 anchor:term2^2.0 content:term2title:term2^1.5 host:term2^2.0) url:"term1 term2"~2147483647^4.0anchor:"term1 term2"~4^2.0 content:"term1 term2"~2147483647 title:"term1term2"~2147483647^1.5 host:"term1 term2"~2147483647^2.0

For a simple TermQuery, if the DF(term) is above 10%, the response timefrom IndexSearcher.search() is around 400ms (repeatable, after warm-up).For such complex phrase queries the response time is around 1 sec ormore (again, after warm-up).

Initially I thought the process is I/O or heap/GC bound, this is a largeindex after all, but the profiler shows it's purely CPU bound. I trackedthe bottleneck to the scorers (see my previous email on this), but alsoto IndexInput.readVInt.. What's even more curious, most of the heap isunused - I had the impression that Lucene tries to read as much of theindex as it can to memory in order to speed up the access, butapparently that's not the case. The heap consumption was always in theorder of 100-200MB, no matter how large heap I set (and I tried valuesbetween 1-4GB).


For those interested in profiler info, look here:

http://www.getopt.org/lucene/20051202/

Here's an example of elapsed times [ms] for IndexSearcher.search, andfor getting the first 100 docs using Hits.doc(i):


19. Complex search1:
search: 1309
hits.doc: 4
19. Complex search2:
search: 2492
hits.doc: 5
19. Simple search:
search: 392
hits.doc: 5
20. Complex search1:
search: 1307
hits.doc: 5
20. Complex search2:
search: 2499
hits.doc: 5
20. Simple search:
search: 391
hits.doc: 5


I would appreciate any suggestions how to proceed with this...

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Lucene performance bottlenecks

Reply via email to