Paul Elschot wrote:
Querying the host field like this in a web page index can be dangerous
business. For example when term1 is "wikipedia" and term2 is "org",
the query will match at least all pages from wikipedia.org.
Note that if you search for wikipedia.org in Nutch this is interpreted
as an implicit phrase query and is expanded differently, as:
+(url:"wikipedia org"^4.0
anchor:"wikipedia org"^2.0
content:"wikipedia org"
title:"wikipedia org"^1.5
host:"wikipedia org"^2.0)
Note also that Nutch can use N-grams for common terms. One can thus
configure things so that this would be instead:
+(url:"wikipedia-org"^4.0
anchor:"wikipedia org"^2.0
content:"wikipedia org"
title:"wikipedia org"^1.5
host:"wikipedia-org"^2.0)
where wikipedia-org is a bigram term indexed when org occurs in the url
or host field.
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]