-----Original Message----- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Thursday, February 18, 2010 3:09 PM To: java-user@lucene.apache.org Subject: Re: BM25 Scoring Patch
Yuval, don't we still need this 'document-level IDF' for BM25f? - Yes, we do need 'document-level IDF' for BM25f. - Joaquin tried to bypass this by using the IDF of the field having the longest average length instead - of the document's IDF. - This introduces some bias into the scoring formula, but maybe it is not too large... On Thu, Feb 18, 2010 at 3:45 AM, Yuval Feinstein <yuv...@answers.com> wrote: > We could solve this by saying we only incorporate BM25F into Lucene. > This is a field-based scoring method, so it saves us the need to deal with > documents. > Building on Joaquin's work, the extra parts needed IMO are: > a. Support for storing average length per field during indexing. I think I > saw some reference to this > when Grant described the new features in Lucene 2.9. We need to store two > numbers (say > number of documents containing the field and average length) to support > incremental indexing. > b. Easy integration of BM25F similarity - default parameter values, working > with regular Lucene class hierarchy. > c. Support for all regular query types - PhraseQuery, FuzzyQuery etc. (We > could do this incrementally, > throwing an "UnsupportedOperationException" in the meantime). > d. Some work on run-time efficiency, to be near the efficiency of the > default scoring. > I could do some of this work myself, but guidance from a Lucene scoring > guru would be a great help. > Thanks, > Yuval >