RE: BM25 Similarity implementation

Trieschnigg, R.B. \(Dolf\) Fri, 17 Feb 2006 02:52:06 -0800

Sorry, the image wasn't sent:
http://wwwhome.cs.utwente.nl/~trieschn/bm25.PNG


> -----Original Message-----
> From: Trieschnigg, R.B. (Dolf) 
> [mailto:[EMAIL PROTECTED] 
> Sent: vrijdag 17 februari 2006 10:54
> To: java-user@lucene.apache.org
> Subject: RE: BM25 Similarity implementation
> 
> > > I would like to implement the Okapi BM25 weighting 
> function using my 
> > > own Similarity implementation. Unfortunately BM25 requires the 
> > > document length in the score calculation, which is not 
> provided by 
> > > the Scorer.
> > 
> > How do you want to measure document length?  If the number 
> of tokens 
> > is an acceptable measure, then the norm contains
> > sqrt(numTokens) by default.  You can modify your
> > Similarity.lengthNorm() implementation to not perform the sqrt, or 
> > square the norm.
> 
> I assume the number of tokens will be a good estimate.
> 
> I've included an image with the algorithm (my ASCII art isn't 
> that good).
> Legend of the figure:
> - k1, k3 and b are constants
> - tf is the within document term frequency
> - df is the document frequency
> - N is the collection size
> - r is the number of relevant documents containing a 
> particular term (without relevance information assumed to be 0)
> - R is the number of items known to be relevant to a specific 
> topic (without relevance information assumed to be 0)
> 
> As far is I understand Lucene multiplies the squared weight 
> with the result of Similarity.lengthNorm(), but BM25 requires 
> the document length for the calculation of the document term 
> weighting (as far as I know it's not possible to extract the 
> influence of the normalization as a constant multiplier).
> 
> Am I missing something here?
> 
> Dolf

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: BM25 Similarity implementation

Reply via email to