RE: BM25 Scoring Patch

Yuval Feinstein Thu, 18 Feb 2010 00:45:56 -0800

We could solve this by saying we only incorporate BM25F into Lucene.
This is a field-based scoring method, so it saves us the need to deal with 
documents.
Building on Joaquin's work, the extra parts needed IMO are:
a. Support for storing average length per field during indexing. I think I saw 
some reference to this
when Grant described the new features in Lucene 2.9. We need to store two 
numbers (say
number of documents containing the field and average length) to support 
incremental indexing.
b. Easy integration of BM25F similarity - default parameter values, working 
with regular Lucene class hierarchy.
c. Support for all regular query types - PhraseQuery, FuzzyQuery etc. (We could 
do this incrementally,
throwing an "UnsupportedOperationException" in the meantime).
d. Some work on run-time efficiency, to be near the efficiency of the default 
scoring.
I could do some of this work myself, but guidance from a Lucene scoring guru 
would be a great help.
Thanks,
Yuval


-----Original Message-----
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Wednesday, February 17, 2010 6:47 PM
To: java-user@lucene.apache.org
Subject: Re: BM25 Scoring Patch

I tend to agree with you Marvin, you are right, the different scoring
mechanisms need different information available and this is the problem.

although last I checked, one hard part of BM25 rotates around fields versus
documents... e.g. BM25's IDF calculation.

but maybe this is just an extreme form of your example :)

On Wed, Feb 17, 2010 at 11:39 AM, Marvin Humphrey <mar...@rectangular.com>wrote:

> On Wed, Feb 17, 2010 at 10:31:19AM -0500, Robert Muir wrote:
> > yet if we don't do the hard work up front to make it easy to plug in
> things
> > like BM25, then no one will implement additional scoring formulas for
> > Lucene, we currently make it terribly difficult to do this.
>
> FWIW... Similarity and posting format spec are so closely tied that I'm
> considering linking them in Lucy.
>
>  Schema schema = new Schema();
>  FullTextType bm25Type = new FullTextType(new BM25Similarity());
>  schema.specField("content", bm25Type);
>  schema.specField("title", bm25Type);
>  StringType matchType = new StringType(new MatchSimilarity());
>  schema.specField("category", matchType);
>
> That way, custom scoring implementations can guarantee that they always
> have
> the posting information they need available to make their similarity
> judgments.  Similarity also becomes a more generalized notion, with the
> TF/IDF-specific functionality moving into a subclass.
>
> Maybe something similar could be made to work in Lucene.  Dunno how
> McCandless
> has things set up for spec'ing codecs on the flex branch.
>
> Marvin Humphrey
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
Robert Muir
rcm...@gmail.com

RE: BM25 Scoring Patch

Reply via email to