Yuval, don't we still need this 'document-level IDF' for BM25f? On Thu, Feb 18, 2010 at 3:45 AM, Yuval Feinstein <yuv...@answers.com> wrote:
> We could solve this by saying we only incorporate BM25F into Lucene. > This is a field-based scoring method, so it saves us the need to deal with > documents. > Building on Joaquin's work, the extra parts needed IMO are: > a. Support for storing average length per field during indexing. I think I > saw some reference to this > when Grant described the new features in Lucene 2.9. We need to store two > numbers (say > number of documents containing the field and average length) to support > incremental indexing. > b. Easy integration of BM25F similarity - default parameter values, working > with regular Lucene class hierarchy. > c. Support for all regular query types - PhraseQuery, FuzzyQuery etc. (We > could do this incrementally, > throwing an "UnsupportedOperationException" in the meantime). > d. Some work on run-time efficiency, to be near the efficiency of the > default scoring. > I could do some of this work myself, but guidance from a Lucene scoring > guru would be a great help. > Thanks, > Yuval > > -----Original Message----- > From: Robert Muir [mailto:rcm...@gmail.com] > Sent: Wednesday, February 17, 2010 6:47 PM > To: java-user@lucene.apache.org > Subject: Re: BM25 Scoring Patch > > I tend to agree with you Marvin, you are right, the different scoring > mechanisms need different information available and this is the problem. > > although last I checked, one hard part of BM25 rotates around fields versus > documents... e.g. BM25's IDF calculation. > > but maybe this is just an extreme form of your example :) > > On Wed, Feb 17, 2010 at 11:39 AM, Marvin Humphrey <mar...@rectangular.com > >wrote: > > > On Wed, Feb 17, 2010 at 10:31:19AM -0500, Robert Muir wrote: > > > yet if we don't do the hard work up front to make it easy to plug in > > things > > > like BM25, then no one will implement additional scoring formulas for > > > Lucene, we currently make it terribly difficult to do this. > > > > FWIW... Similarity and posting format spec are so closely tied that I'm > > considering linking them in Lucy. > > > > Schema schema = new Schema(); > > FullTextType bm25Type = new FullTextType(new BM25Similarity()); > > schema.specField("content", bm25Type); > > schema.specField("title", bm25Type); > > StringType matchType = new StringType(new MatchSimilarity()); > > schema.specField("category", matchType); > > > > That way, custom scoring implementations can guarantee that they always > > have > > the posting information they need available to make their similarity > > judgments. Similarity also becomes a more generalized notion, with the > > TF/IDF-specific functionality moving into a subclass. > > > > Maybe something similar could be made to work in Lucene. Dunno how > > McCandless > > has things set up for spec'ing codecs on the flex branch. > > > > Marvin Humphrey > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > -- > Robert Muir > rcm...@gmail.com > -- Robert Muir rcm...@gmail.com