Yuval, don't we still need this 'document-level IDF' for BM25f?

On Thu, Feb 18, 2010 at 3:45 AM, Yuval Feinstein <yuv...@answers.com> wrote:

> We could solve this by saying we only incorporate BM25F into Lucene.
> This is a field-based scoring method, so it saves us the need to deal with
> documents.
> Building on Joaquin's work, the extra parts needed IMO are:
> a. Support for storing average length per field during indexing. I think I
> saw some reference to this
> when Grant described the new features in Lucene 2.9. We need to store two
> numbers (say
> number of documents containing the field and average length) to support
> incremental indexing.
> b. Easy integration of BM25F similarity - default parameter values, working
> with regular Lucene class hierarchy.
> c. Support for all regular query types - PhraseQuery, FuzzyQuery etc. (We
> could do this incrementally,
> throwing an "UnsupportedOperationException" in the meantime).
> d. Some work on run-time efficiency, to be near the efficiency of the
> default scoring.
> I could do some of this work myself, but guidance from a Lucene scoring
> guru would be a great help.
> Thanks,
> Yuval
>
> -----Original Message-----
> From: Robert Muir [mailto:rcm...@gmail.com]
> Sent: Wednesday, February 17, 2010 6:47 PM
> To: java-user@lucene.apache.org
> Subject: Re: BM25 Scoring Patch
>
> I tend to agree with you Marvin, you are right, the different scoring
> mechanisms need different information available and this is the problem.
>
> although last I checked, one hard part of BM25 rotates around fields versus
> documents... e.g. BM25's IDF calculation.
>
> but maybe this is just an extreme form of your example :)
>
> On Wed, Feb 17, 2010 at 11:39 AM, Marvin Humphrey <mar...@rectangular.com
> >wrote:
>
> > On Wed, Feb 17, 2010 at 10:31:19AM -0500, Robert Muir wrote:
> > > yet if we don't do the hard work up front to make it easy to plug in
> > things
> > > like BM25, then no one will implement additional scoring formulas for
> > > Lucene, we currently make it terribly difficult to do this.
> >
> > FWIW... Similarity and posting format spec are so closely tied that I'm
> > considering linking them in Lucy.
> >
> >  Schema schema = new Schema();
> >  FullTextType bm25Type = new FullTextType(new BM25Similarity());
> >  schema.specField("content", bm25Type);
> >  schema.specField("title", bm25Type);
> >  StringType matchType = new StringType(new MatchSimilarity());
> >  schema.specField("category", matchType);
> >
> > That way, custom scoring implementations can guarantee that they always
> > have
> > the posting information they need available to make their similarity
> > judgments.  Similarity also becomes a more generalized notion, with the
> > TF/IDF-specific functionality moving into a subclass.
> >
> > Maybe something similar could be made to work in Lucene.  Dunno how
> > McCandless
> > has things set up for spec'ing codecs on the flex branch.
> >
> > Marvin Humphrey
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>
>
> --
> Robert Muir
> rcm...@gmail.com
>



-- 
Robert Muir
rcm...@gmail.com

Reply via email to