Re: BM25 Scoring Patch

Joaquin Perez Iglesias Tue, 16 Feb 2010 08:26:43 -0800

Hi Ivan,

You shouldn't set the BM25Similarity for indexing or searching.
Please try removing the lines:
   writer.setSimilarity(new BM25Similarity());
   searcher.setSimilarity(sim);


Please let us/me know if you improve your results with these changes.


Robert Muir escribió:

Hi Ivan, I've seen many cases where BM25 performs worse than Lucene's
default Similarity. Perhaps this is just another one?

Again while I have not worked with this particular collection, I looked at
the statistics and noted that its composed of several 'sub-collections': for
example the PAT documents on disk 3 have an average doc length of 3543, but
the AP documents on disk 1 have an avg doc length of 353.

I have found on other collections that any advantages of BM25's document
length normalization fall apart when 'average document length' doesn't make
a whole lot of sense (cases like this).

For this same reason, I've only found a few collections where BM25's doc
length normalization is really significantly better than Lucene's.

In my opinion, the results on a particular test collection or 2 have perhaps
been taken too far and created a myth that BM25 is always superior to
Lucene's scoring... this is not true!

On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov <iprov...@yahoo.com> wrote:

I applied the Lucene patch mentioned in
https://issues.apache.org/jira/browse/LUCENE-2091 and ran the MAP numbers
on TREC-3 collection using topics 151-200.  I am not getting worse results
comparing to Lucene DefaultSimilarity.  I suspect, I am not using it
correctly.  I have single field documents.  This is the process I use:

1. During the indexing, I am setting the similarity to BM25 as such:

IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(
               Version.LUCENE_CURRENT), true,
               IndexWriter.MaxFieldLength.UNLIMITED);
writer.setSimilarity(new BM25Similarity());

2. During the Precision/Recall measurements, I am using a
SimpleBM25QQParser extension I added to the benchmark:

QualityQueryParser qqParser = new SimpleBM25QQParser("title", "TEXT");


3. Here is the parser code (I set an avg doc length here):

public Query parse(QualityQuery qq) throws ParseException {
   BM25Parameters.setAverageLength(indexField, 798.30f);//avg doc length
   BM25Parameters.setB(0.5f);//tried default values
   BM25Parameters.setK1(2f);
   return query = new BM25BooleanQuery(qq.getValue(qqName), indexField, new
StandardAnalyzer(Version.LUCENE_CURRENT));
}

4. The searcher is using BM25 similarity:

Searcher searcher = new IndexSearcher(dir, true);
searcher.setSimilarity(sim);

Am I missing some steps?  Does anyone have experience with this code?

Thanks,

Ivan




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


--
-----------------------------------------------------------
Joaquín Pérez Iglesias
Dpto. Lenguajes y Sistemas Informáticos
E.T.S.I. Informática (UNED)
Ciudad Universitaria
C/ Juan del Rosal nº 16
28040 Madrid - Spain
Phone. +34 91 398 89 19
Fax    +34 91 398 65 35
Office  2.11
Email: joaquin.pe...@lsi.uned.es
web:   http://nlp.uned.es/~jperezi/
-----------------------------------------------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: BM25 Scoring Patch

Reply via email to