Hi Ivan, I've seen many cases where BM25 performs worse than Lucene's default Similarity. Perhaps this is just another one?
Again while I have not worked with this particular collection, I looked at the statistics and noted that its composed of several 'sub-collections': for example the PAT documents on disk 3 have an average doc length of 3543, but the AP documents on disk 1 have an avg doc length of 353. I have found on other collections that any advantages of BM25's document length normalization fall apart when 'average document length' doesn't make a whole lot of sense (cases like this). For this same reason, I've only found a few collections where BM25's doc length normalization is really significantly better than Lucene's. In my opinion, the results on a particular test collection or 2 have perhaps been taken too far and created a myth that BM25 is always superior to Lucene's scoring... this is not true! On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov <iprov...@yahoo.com> wrote: > I applied the Lucene patch mentioned in > https://issues.apache.org/jira/browse/LUCENE-2091 and ran the MAP numbers > on TREC-3 collection using topics 151-200. I am not getting worse results > comparing to Lucene DefaultSimilarity. I suspect, I am not using it > correctly. I have single field documents. This is the process I use: > > 1. During the indexing, I am setting the similarity to BM25 as such: > > IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer( > Version.LUCENE_CURRENT), true, > IndexWriter.MaxFieldLength.UNLIMITED); > writer.setSimilarity(new BM25Similarity()); > > 2. During the Precision/Recall measurements, I am using a > SimpleBM25QQParser extension I added to the benchmark: > > QualityQueryParser qqParser = new SimpleBM25QQParser("title", "TEXT"); > > > 3. Here is the parser code (I set an avg doc length here): > > public Query parse(QualityQuery qq) throws ParseException { > BM25Parameters.setAverageLength(indexField, 798.30f);//avg doc length > BM25Parameters.setB(0.5f);//tried default values > BM25Parameters.setK1(2f); > return query = new BM25BooleanQuery(qq.getValue(qqName), indexField, new > StandardAnalyzer(Version.LUCENE_CURRENT)); > } > > 4. The searcher is using BM25 similarity: > > Searcher searcher = new IndexSearcher(dir, true); > searcher.setSimilarity(sim); > > Am I missing some steps? Does anyone have experience with this code? > > Thanks, > > Ivan > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com