(Moving the discussion to nutch-dev, please drop the cc: when responding)
Doug Cutting wrote:
Andrzej Bialecki wrote:
It's nice to have these couple percent... however, it doesn't solve
the main problem; I need 50 or more percent increase... :-) and I
suspect this can be achieved only by some radical changes in the way
Nutch uses Lucene. It seems the default query structure is too
complex to get a decent performance.
That would certainly help.
For what it's worth, the Internet Archive has ~10M page Nutch indexes
that perform adequately. See:
http://websearch.archive.org/katrina/
Hmm... Please define what "adequate" means. :-) IMHO, "adequate" is when
for any query the response time is well below 1 second. Otherwise the
service seems sluggish. Response times over 3 seconds are normally not
acceptable. This is just for a single concurrent query - the number of
concurrent queries will be a function of the number of concurrent users,
and the search response time, until it reaches the limit of the number
of threads on the search servers. Then, the time it takes to return the
results should give us the maximum concurrent query-per-second estimate.
There is a total of 8,435,793 pages in that index. Here's a short list
of queries, the number of matching pages, and the average time (I made
just a couple of tests, no stress-loading ;-) )
* hurricane: 1,273,805 pages, 1.75 seconds.
* katrina: 1,267,240 pages, 1.76 seconds
* gov: 979,820 pages, 1.01 seconds
* hurricane katrina: 773,001 pages, 3.5 seconds (!)
* "hurricane katrina": 600,867 pages, 1.35 seconds
* disaster relief: 205,066 pages, 1.12 seconds
* "disaster relief": 140,007 pages, 0.42 seconds
* hurricane katrina disaster relief: 129,353 pages, 1.99 seconds
* "hurricane katrina disaster relief": 2,006 pages, 0.705 seconds
* xys: 227 pages, 0.01 seconds
* xyz: 3,497 pages, 0.005 seconds
The performance is about what you report, but it is quite usable.
(Please don't stress-test this server!) We recently built a ~100M
page Nutch index at the Internet Archive that is surprisingly usable
on a single CPU. (This is not yet publicly accessible.)
What I found out is that "usable" depends a lot on how you test it and
what is your minimum expectation. There are some high-frequency terms
(and by this I mean terms with frequency around 25%) that will
consistently cause a dramatic slowdown. Multi-term queries, because of
the way Nutch expands them into sloppy phrases, may take even more time,
so even for such relatively small index (from the POV of the whole
Internet!) the response time may drag into several seconds (try "com").
Perhaps your traffic will be much higher than the Internet Archive's,
or you have contractual obligations that specify certain average query
performance, but, if not, ~10M pages is quite searchable using Nutch
on a single CPU.
I'm not concerned about the traffic - I believe the distributed search
can handle a lot of traffic if need be. What I'm concerned about is the
maximum response time from individual search servers. This is because
the front-end response time is determined by the longest response time
from any of the (active) search servers. Response times over 1 sec. from
a 10 mln collection are IMHO not adequate, because the service will
appear slow. Response times over several seconds would mean that users
would say goodbye and never return... ;-)
If 10 mln docs is too much for a single server to meet such a
performance target, then this explodes the total number of servers
required to handle Internet-wide collections of billions pages...
So, I think it's time to re-think the query structure and scoring
mechanisms, in order to simplify the Lucene queries generated by Nutch -
or to do some other tricks...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]