Hi,
I'm doing some performance profiling of a Nutch installation, working
with relatively large individual indexes (10 mln docs), and I'm puzzled
with the results.
Here's the listing of the index:
-rw-r--r-- 1 andrzej andrzej 9803100 Dec 2 05:24 _0.f0
-rw-r--r-- 1 andrzej andrzej 9803100 Dec 2 05:24 _0.f1
-rw-r--r-- 1 andrzej andrzej 9803100 Dec 2 05:24 _0.f2
-rw-r--r-- 1 andrzej andrzej 9803100 Dec 2 05:24 _0.f3
-rw-r--r-- 1 andrzej andrzej 9803100 Dec 2 05:24 _0.f4
-rw-r--r-- 1 andrzej andrzej 9803100 Dec 2 05:24 _0.f5
-rw-r--r-- 1 andrzej andrzej 9803100 Dec 2 05:25 _0.f6
-rw-r--r-- 1 andrzej andrzej 9803100 Dec 2 05:25 _0.f7
-rw-r--r-- 1 andrzej andrzej 9803100 Dec 2 05:25 _0.f8
-rw-r--r-- 1 andrzej andrzej 2494445020 Dec 2 04:58 _0.fdt
-rw-r--r-- 1 andrzej andrzej 78424800 Dec 2 04:58 _0.fdx
-rw-r--r-- 1 andrzej andrzej 92 Dec 2 04:55 _0.fnm
-rw-r--r-- 1 andrzej andrzej 7436259508 Dec 2 05:24 _0.frq
-rw-r--r-- 1 andrzej andrzej 12885589796 Dec 2 05:24 _0.prx
-rw-r--r-- 1 andrzej andrzej 3483642 Dec 2 05:24 _0.tii
-rw-r--r-- 1 andrzej andrzej 280376933 Dec 2 05:24 _0.tis
-rw-r--r-- 1 andrzej andrzej 4 Dec 2 05:25 deletable
-rw-r--r-- 1 andrzej andrzej 27 Dec 2 05:25 segments
I run it on an AMD Opteron 246, 2Ghz, 4GB RAM, java -version says:
Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_05-b05, mixed mode)
I run it with a heap of 1.5-2.5 GB, which doesn't make any difference
(see below). I'm using the latest SVN code (from yesterday) +
performance enhancements to ConjunctionScorer and BooleanScorer2 from JIRA.
The performance is less than impressive, response times being more than
1 sec. Nutch produces complex queries for phrases, so the user query
"term1 term2" gets rewritten like this:
+(url:term1^4.0 anchor:term1^2.0 content:term1 title:term1^1.5
host:term1^2.0) +(url:term2^4.0 anchor:term2^2.0 content:term2
title:term2^1.5 host:term2^2.0) url:"term1 term2"~2147483647^4.0
anchor:"term1 term2"~4^2.0 content:"term1 term2"~2147483647 title:"term1
term2"~2147483647^1.5 host:"term1 term2"~2147483647^2.0
For a simple TermQuery, if the DF(term) is above 10%, the response time
from IndexSearcher.search() is around 400ms (repeatable, after warm-up).
For such complex phrase queries the response time is around 1 sec or
more (again, after warm-up).
Initially I thought the process is I/O or heap/GC bound, this is a large
index after all, but the profiler shows it's purely CPU bound. I tracked
the bottleneck to the scorers (see my previous email on this), but also
to IndexInput.readVInt.. What's even more curious, most of the heap is
unused - I had the impression that Lucene tries to read as much of the
index as it can to memory in order to speed up the access, but
apparently that's not the case. The heap consumption was always in the
order of 100-200MB, no matter how large heap I set (and I tried values
between 1-4GB).
For those interested in profiler info, look here:
http://www.getopt.org/lucene/20051202/
Here's an example of elapsed times [ms] for IndexSearcher.search, and
for getting the first 100 docs using Hits.doc(i):
19. Complex search1:
search: 1309
hits.doc: 4
19. Complex search2:
search: 2492
hits.doc: 5
19. Simple search:
search: 392
hits.doc: 5
20. Complex search1:
search: 1307
hits.doc: 5
20. Complex search2:
search: 2499
hits.doc: 5
20. Simple search:
search: 391
hits.doc: 5
I would appreciate any suggestions how to proceed with this...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]