You might also be interested in https://issues.apache.org/jira/browse/ LUCENE-755 (aka the Payloads patch) which will enable storing information at the token level and allow for plugging in more scoring options related to it.

There has been a variety of discussions over on java-dev related to topics that would enable things like PageRank, etc. at a lower level in Lucene. Search java-dev for "Flexible Indexing", "Payloads", etc.

You will also find additional Lucene scoring information at http:// lucene.apache.org/java/docs/scoring.html

-Grant

On Jan 22, 2007, at 4:44 PM, Nicolas Lalevée wrote:

Le Lundi 22 Janvier 2007 20:44, EDMOND KEMOKAI a écrit :
Hmm..doesn't lucene scoring determine how relevant a document is to your
query? That is what PageRank and HITS do as well, I believe. Page and
document are the same, if you want to index a page you'll obviously try to convert it into a document. PageRank does link analysis to determine how relevant that page is as it relates to the query you entered, does lucene have something similar? How does lucene determine between two documents which one should score higher if they both contain a certain term? Google
uses PageRank to make that determination, how does lucene do it?

The thing is that Google can use both its PageRank and Lucene.

The PageRank caculation doesn't depend on the query. It's value is quite
absolute. It depends of how the other pages refer to it.

And the Lucene scoring is calculating how a document (document is the generic Lucene term, beacause Lucene can be used for other things that web search) match the query. Searching for "foo bar", a document with no "foo", no "bar" in it it scored at 0, even more not scored at all. A document with one "foo", but no "bar" will be scored to 0.5, with one "foo" and one "bar", separated with 15 words, will be scored 0.8, and a document with "foo bar", scored to
1.0.

Then you can combine the both, as Google is probably doing. Same exemple : - a page with no "foo", neither "bar", whatever is its page rank, won't be
scored.
- a page with a "foo" and a "bar", with a low page rank, let say 0.5
- a page with a "foo" and a "bar", with a high page rank, let say 0.9

If you want to know more about Lucene scoring and Page rank, you should go see
Nutch and its mailing list : http://lucene.apache.org/nutch/

Nicolas


On 1/22/07, Nicolas Lalevée <[EMAIL PROTECTED]> wrote:
Le Lundi 22 Janvier 2007 19:33, EDMOND KEMOKAI a écrit:
Hi All
This is a question for those familiar with lucene document scoring. How
does it compare with googles PageRank or HITS, or are they very

different?

I have being looking at the PageRank algorithm but I'll need to

brush-off

my math skills before delving into it:)

In fact Lucene is just a search engine. Then you can use the search
engine to
search in web pages, like Nutch is using Lucene. And Google is more like Nutch : a web crawler plus a web-search engine. So when you are taking
about
page raking, it has nothing to do with Lucene scoring. Lucene scoring is
how
about the result entry match your query. Page raking is more about how relevant is the web page. So for a document, the Lucene scoring depends
on the query, and the page raking is quite absolute.

Nicolas

-------------------------------------------------------------------- -
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to