Hi Ivan,
You shouldn't set the BM25Similarity for indexing or searching.
Please try removing the lines:
writer.setSimilarity(new BM25Similarity());
searcher.setSimilarity(sim);
Please let us/me know if you improve your results with these changes.
Robert Muir escribió:
Hi Ivan, I've seen many cases where BM25 performs worse than Lucene's
default Similarity. Perhaps this is just another one?
Again while I have not worked with this particular collection, I looked at
the statistics and noted that its composed of several 'sub-collections': for
example the PAT documents on disk 3 have an average doc length of 3543, but
the AP documents on disk 1 have an avg doc length of 353.
I have found on other collections that any advantages of BM25's document
length normalization fall apart when 'average document length' doesn't make
a whole lot of sense (cases like this).
For this same reason, I've only found a few collections where BM25's doc
length normalization is really significantly better than Lucene's.
In my opinion, the results on a particular test collection or 2 have perhaps
been taken too far and created a myth that BM25 is always superior to
Lucene's scoring... this is not true!
On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov <iprov...@yahoo.com> wrote:
I applied the Lucene patch mentioned in
https://issues.apache.org/jira/browse/LUCENE-2091 and ran the MAP numbers
on TREC-3 collection using topics 151-200. I am not getting worse results
comparing to Lucene DefaultSimilarity. I suspect, I am not using it
correctly. I have single field documents. This is the process I use:
1. During the indexing, I am setting the similarity to BM25 as such:
IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(
Version.LUCENE_CURRENT), true,
IndexWriter.MaxFieldLength.UNLIMITED);
writer.setSimilarity(new BM25Similarity());
2. During the Precision/Recall measurements, I am using a
SimpleBM25QQParser extension I added to the benchmark:
QualityQueryParser qqParser = new SimpleBM25QQParser("title", "TEXT");
3. Here is the parser code (I set an avg doc length here):
public Query parse(QualityQuery qq) throws ParseException {
BM25Parameters.setAverageLength(indexField, 798.30f);//avg doc length
BM25Parameters.setB(0.5f);//tried default values
BM25Parameters.setK1(2f);
return query = new BM25BooleanQuery(qq.getValue(qqName), indexField, new
StandardAnalyzer(Version.LUCENE_CURRENT));
}
4. The searcher is using BM25 similarity:
Searcher searcher = new IndexSearcher(dir, true);
searcher.setSimilarity(sim);
Am I missing some steps? Does anyone have experience with this code?
Thanks,
Ivan
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
--
-----------------------------------------------------------
Joaquín Pérez Iglesias
Dpto. Lenguajes y Sistemas Informáticos
E.T.S.I. Informática (UNED)
Ciudad Universitaria
C/ Juan del Rosal nº 16
28040 Madrid - Spain
Phone. +34 91 398 89 19
Fax +34 91 398 65 35
Office 2.11
Email: joaquin.pe...@lsi.uned.es
web: http://nlp.uned.es/~jperezi/
-----------------------------------------------------------
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org