Re: Lucene performance bottlenecks

Andrzej Bialecki Thu, 08 Dec 2005 01:06:57 -0800

(Moving the discussion to nutch-dev, please drop the cc: when responding)


Doug Cutting wrote:

Andrzej Bialecki wrote:
It's nice to have these couple percent... however, it doesn't solvethe main problem; I need 50 or more percent increase... :-) and Isuspect this can be achieved only by some radical changes in the wayNutch uses Lucene. It seems the default query structure is toocomplex to get a decent performance.
That would certainly help.
For what it's worth, the Internet Archive has ~10M page Nutch indexesthat perform adequately. See:
http://websearch.archive.org/katrina/

Hmm... Please define what "adequate" means. :-) IMHO, "adequate" is whenfor any query the response time is well below 1 second. Otherwise theservice seems sluggish. Response times over 3 seconds are normally notacceptable. This is just for a single concurrent query - the number ofconcurrent queries will be a function of the number of concurrent users,and the search response time, until it reaches the limit of the numberof threads on the search servers. Then, the time it takes to return theresults should give us the maximum concurrent query-per-second estimate.

There is a total of 8,435,793 pages in that index. Here's a short listof queries, the number of matching pages, and the average time (I madejust a couple of tests, no stress-loading ;-) )


* hurricane: 1,273,805 pages, 1.75 seconds.
* katrina: 1,267,240 pages, 1.76 seconds
* gov: 979,820 pages, 1.01 seconds
* hurricane katrina: 773,001 pages, 3.5 seconds (!)
* "hurricane katrina": 600,867 pages, 1.35 seconds
* disaster relief: 205,066 pages, 1.12 seconds
* "disaster relief": 140,007 pages, 0.42 seconds
* hurricane katrina disaster relief: 129,353 pages, 1.99 seconds
* "hurricane katrina disaster relief": 2,006 pages, 0.705 seconds
* xys: 227 pages, 0.01 seconds
* xyz: 3,497 pages,  0.005 seconds

The performance is about what you report, but it is quite usable.(Please don't stress-test this server!) We recently built a ~100Mpage Nutch index at the Internet Archive that is surprisingly usableon a single CPU. (This is not yet publicly accessible.)

What I found out is that "usable" depends a lot on how you test it andwhat is your minimum expectation. There are some high-frequency terms(and by this I mean terms with frequency around 25%) that willconsistently cause a dramatic slowdown. Multi-term queries, because ofthe way Nutch expands them into sloppy phrases, may take even more time,so even for such relatively small index (from the POV of the wholeInternet!) the response time may drag into several seconds (try "com").

Perhaps your traffic will be much higher than the Internet Archive's,or you have contractual obligations that specify certain average queryperformance, but, if not, ~10M pages is quite searchable using Nutchon a single CPU.

I'm not concerned about the traffic - I believe the distributed searchcan handle a lot of traffic if need be. What I'm concerned about is themaximum response time from individual search servers. This is becausethe front-end response time is determined by the longest response timefrom any of the (active) search servers. Response times over 1 sec. froma 10 mln collection are IMHO not adequate, because the service willappear slow. Response times over several seconds would mean that userswould say goodbye and never return... ;-)

If 10 mln docs is too much for a single server to meet such aperformance target, then this explodes the total number of serversrequired to handle Internet-wide collections of billions pages...

So, I think it's time to re-think the query structure and scoringmechanisms, in order to simplify the Lucene queries generated by Nutch -or to do some other tricks...


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene performance bottlenecks

Reply via email to