Klaus wrote:
I have tried to study to lucene scoring in the default similarity. Can anyone explain me, how this similarity was designed? I have read a lot of IR literature, but I have never seen an equation like the one used in lucene. Why is this better then the normal cosine-measure?
It degenerates to the normal cosine measure. So it's the cosine measure with a few bells and whistles.
The tf(), idf(), lengthNorm() and queryNorm() are directly from the cosine measure, although lengthNorm()'s default implemenation uses an approximation.
So the non-standard bits are getBoost(), which permits incorporation of a-priori document weights, like PageRank, and coord(), which makes OR's more AND-like. Cosine is OR-based, but, for short queries over large collections, AND tends to give better results than OR.
Doug --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]