Re: BM25 Scoring Patch

Robert Muir Tue, 16 Feb 2010 08:19:00 -0800

Hi Ivan, I've seen many cases where BM25 performs worse than Lucene's
default Similarity. Perhaps this is just another one?

Again while I have not worked with this particular collection, I looked at
the statistics and noted that its composed of several 'sub-collections': for
example the PAT documents on disk 3 have an average doc length of 3543, but
the AP documents on disk 1 have an avg doc length of 353.

I have found on other collections that any advantages of BM25's document
length normalization fall apart when 'average document length' doesn't make
a whole lot of sense (cases like this).

For this same reason, I've only found a few collections where BM25's doc
length normalization is really significantly better than Lucene's.

In my opinion, the results on a particular test collection or 2 have perhaps
been taken too far and created a myth that BM25 is always superior to
Lucene's scoring... this is not true!

On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov <iprov...@yahoo.com> wrote:

> I applied the Lucene patch mentioned in
> https://issues.apache.org/jira/browse/LUCENE-2091 and ran the MAP numbers
> on TREC-3 collection using topics 151-200.  I am not getting worse results
> comparing to Lucene DefaultSimilarity.  I suspect, I am not using it
> correctly.  I have single field documents.  This is the process I use:
>
> 1. During the indexing, I am setting the similarity to BM25 as such:
>
> IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(
>                Version.LUCENE_CURRENT), true,
>                IndexWriter.MaxFieldLength.UNLIMITED);
> writer.setSimilarity(new BM25Similarity());
>
> 2. During the Precision/Recall measurements, I am using a
> SimpleBM25QQParser extension I added to the benchmark:
>
> QualityQueryParser qqParser = new SimpleBM25QQParser("title", "TEXT");
>
>
> 3. Here is the parser code (I set an avg doc length here):
>
> public Query parse(QualityQuery qq) throws ParseException {
>    BM25Parameters.setAverageLength(indexField, 798.30f);//avg doc length
>    BM25Parameters.setB(0.5f);//tried default values
>    BM25Parameters.setK1(2f);
>    return query = new BM25BooleanQuery(qq.getValue(qqName), indexField, new
> StandardAnalyzer(Version.LUCENE_CURRENT));
> }
>
> 4. The searcher is using BM25 similarity:
>
> Searcher searcher = new IndexSearcher(dir, true);
> searcher.setSimilarity(sim);
>
> Am I missing some steps?  Does anyone have experience with this code?
>
> Thanks,
>
> Ivan
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-- 
Robert Muir
rcm...@gmail.com

Re: BM25 Scoring Patch

Reply via email to