Hi,

I'm doing some performance profiling of a Nutch installation, working with relatively large individual indexes (10 mln docs), and I'm puzzled with the results.

Here's the listing of the index:
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:24 _0.f0
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:24 _0.f1
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:24 _0.f2
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:24 _0.f3
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:24 _0.f4
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:24 _0.f5
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:25 _0.f6
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:25 _0.f7
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:25 _0.f8
-rw-r--r--  1 andrzej andrzej  2494445020 Dec  2 04:58 _0.fdt
-rw-r--r--  1 andrzej andrzej    78424800 Dec  2 04:58 _0.fdx
-rw-r--r--  1 andrzej andrzej          92 Dec  2 04:55 _0.fnm
-rw-r--r--  1 andrzej andrzej  7436259508 Dec  2 05:24 _0.frq
-rw-r--r--  1 andrzej andrzej 12885589796 Dec  2 05:24 _0.prx
-rw-r--r--  1 andrzej andrzej     3483642 Dec  2 05:24 _0.tii
-rw-r--r--  1 andrzej andrzej   280376933 Dec  2 05:24 _0.tis
-rw-r--r--  1 andrzej andrzej           4 Dec  2 05:25 deletable
-rw-r--r--  1 andrzej andrzej          27 Dec  2 05:25 segments


I run it on an AMD Opteron 246, 2Ghz, 4GB RAM, java -version says:

Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_05-b05, mixed mode)

I run it with a heap of 1.5-2.5 GB, which doesn't make any difference (see below). I'm using the latest SVN code (from yesterday) + performance enhancements to ConjunctionScorer and BooleanScorer2 from JIRA.

The performance is less than impressive, response times being more than 1 sec. Nutch produces complex queries for phrases, so the user query "term1 term2" gets rewritten like this:

+(url:term1^4.0 anchor:term1^2.0 content:term1 title:term1^1.5 host:term1^2.0) +(url:term2^4.0 anchor:term2^2.0 content:term2 title:term2^1.5 host:term2^2.0) url:"term1 term2"~2147483647^4.0 anchor:"term1 term2"~4^2.0 content:"term1 term2"~2147483647 title:"term1 term2"~2147483647^1.5 host:"term1 term2"~2147483647^2.0

For a simple TermQuery, if the DF(term) is above 10%, the response time from IndexSearcher.search() is around 400ms (repeatable, after warm-up). For such complex phrase queries the response time is around 1 sec or more (again, after warm-up).

Initially I thought the process is I/O or heap/GC bound, this is a large index after all, but the profiler shows it's purely CPU bound. I tracked the bottleneck to the scorers (see my previous email on this), but also to IndexInput.readVInt.. What's even more curious, most of the heap is unused - I had the impression that Lucene tries to read as much of the index as it can to memory in order to speed up the access, but apparently that's not the case. The heap consumption was always in the order of 100-200MB, no matter how large heap I set (and I tried values between 1-4GB).

For those interested in profiler info, look here:

http://www.getopt.org/lucene/20051202/

Here's an example of elapsed times [ms] for IndexSearcher.search, and for getting the first 100 docs using Hits.doc(i):

19. Complex search1:
search: 1309
hits.doc: 4
19. Complex search2:
search: 2492
hits.doc: 5
19. Simple search:
search: 392
hits.doc: 5
20. Complex search1:
search: 1307
hits.doc: 5
20. Complex search2:
search: 2499
hits.doc: 5
20. Simple search:
search: 391
hits.doc: 5


I would appreciate any suggestions how to proceed with this...

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to