(Moving the discussion to nutch-dev, please drop the cc: when responding)

Doug Cutting wrote:

Andrzej Bialecki wrote:

It's nice to have these couple percent... however, it doesn't solve the main problem; I need 50 or more percent increase... :-) and I suspect this can be achieved only by some radical changes in the way Nutch uses Lucene. It seems the default query structure is too complex to get a decent performance.


That would certainly help.

For what it's worth, the Internet Archive has ~10M page Nutch indexes that perform adequately. See:

http://websearch.archive.org/katrina/


Hmm... Please define what "adequate" means. :-) IMHO, "adequate" is when for any query the response time is well below 1 second. Otherwise the service seems sluggish. Response times over 3 seconds are normally not acceptable. This is just for a single concurrent query - the number of concurrent queries will be a function of the number of concurrent users, and the search response time, until it reaches the limit of the number of threads on the search servers. Then, the time it takes to return the results should give us the maximum concurrent query-per-second estimate.

There is a total of 8,435,793 pages in that index. Here's a short list of queries, the number of matching pages, and the average time (I made just a couple of tests, no stress-loading ;-) )

* hurricane: 1,273,805 pages, 1.75 seconds.
* katrina: 1,267,240 pages, 1.76 seconds
* gov: 979,820 pages, 1.01 seconds
* hurricane katrina: 773,001 pages, 3.5 seconds (!)
* "hurricane katrina": 600,867 pages, 1.35 seconds
* disaster relief: 205,066 pages, 1.12 seconds
* "disaster relief": 140,007 pages, 0.42 seconds
* hurricane katrina disaster relief: 129,353 pages, 1.99 seconds
* "hurricane katrina disaster relief": 2,006 pages, 0.705 seconds
* xys: 227 pages, 0.01 seconds
* xyz: 3,497 pages,  0.005 seconds


The performance is about what you report, but it is quite usable. (Please don't stress-test this server!) We recently built a ~100M page Nutch index at the Internet Archive that is surprisingly usable on a single CPU. (This is not yet publicly accessible.)


What I found out is that "usable" depends a lot on how you test it and what is your minimum expectation. There are some high-frequency terms (and by this I mean terms with frequency around 25%) that will consistently cause a dramatic slowdown. Multi-term queries, because of the way Nutch expands them into sloppy phrases, may take even more time, so even for such relatively small index (from the POV of the whole Internet!) the response time may drag into several seconds (try "com").


Perhaps your traffic will be much higher than the Internet Archive's, or you have contractual obligations that specify certain average query performance, but, if not, ~10M pages is quite searchable using Nutch on a single CPU.


I'm not concerned about the traffic - I believe the distributed search can handle a lot of traffic if need be. What I'm concerned about is the maximum response time from individual search servers. This is because the front-end response time is determined by the longest response time from any of the (active) search servers. Response times over 1 sec. from a 10 mln collection are IMHO not adequate, because the service will appear slow. Response times over several seconds would mean that users would say goodbye and never return... ;-)

If 10 mln docs is too much for a single server to meet such a performance target, then this explodes the total number of servers required to handle Internet-wide collections of billions pages...

So, I think it's time to re-think the query structure and scoring mechanisms, in order to simplify the Lucene queries generated by Nutch - or to do some other tricks...

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to