Sorry, the image wasn't sent: http://wwwhome.cs.utwente.nl/~trieschn/bm25.PNG
> -----Original Message----- > From: Trieschnigg, R.B. (Dolf) > [mailto:[EMAIL PROTECTED] > Sent: vrijdag 17 februari 2006 10:54 > To: java-user@lucene.apache.org > Subject: RE: BM25 Similarity implementation > > > > I would like to implement the Okapi BM25 weighting > function using my > > > own Similarity implementation. Unfortunately BM25 requires the > > > document length in the score calculation, which is not > provided by > > > the Scorer. > > > > How do you want to measure document length? If the number > of tokens > > is an acceptable measure, then the norm contains > > sqrt(numTokens) by default. You can modify your > > Similarity.lengthNorm() implementation to not perform the sqrt, or > > square the norm. > > I assume the number of tokens will be a good estimate. > > I've included an image with the algorithm (my ASCII art isn't > that good). > Legend of the figure: > - k1, k3 and b are constants > - tf is the within document term frequency > - df is the document frequency > - N is the collection size > - r is the number of relevant documents containing a > particular term (without relevance information assumed to be 0) > - R is the number of items known to be relevant to a specific > topic (without relevance information assumed to be 0) > > As far is I understand Lucene multiplies the squared weight > with the result of Similarity.lengthNorm(), but BM25 requires > the document length for the calculation of the document term > weighting (as far as I know it's not possible to extract the > influence of the normalization as a constant multiplier). > > Am I missing something here? > > Dolf --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]