Joaquin, Robert, I followed Joaquin's recommendation and removed the call to set similarity to BM25 explicitly (indexer, searcher). The results showed 55% improvement for the MAP score (0.141->0.219) over default similarity.
Joaquin, how would setting the similarity to BM25 explicitly make the score worse? Thank you, Ivan --- On Tue, 2/16/10, Robert Muir <rcm...@gmail.com> wrote: > From: Robert Muir <rcm...@gmail.com> > Subject: Re: BM25 Scoring Patch > To: java-user@lucene.apache.org > Date: Tuesday, February 16, 2010, 11:36 AM > yes Ivan, if possible please report > back any findings you can on the > experiments you are doing! > > On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez Iglesias > < > joaquin.pe...@lsi.uned.es> > wrote: > > > Hi Ivan, > > > > You shouldn't set the BM25Similarity for indexing or > searching. > > Please try removing the lines: > > writer.setSimilarity(new > BM25Similarity()); > > searcher.setSimilarity(sim); > > > > Please let us/me know if you improve your results with > these changes. > > > > > > Robert Muir escribió: > > > > Hi Ivan, I've seen many cases where BM25 > performs worse than Lucene's > >> default Similarity. Perhaps this is just another > one? > >> > >> Again while I have not worked with this particular > collection, I looked at > >> the statistics and noted that its composed of > several 'sub-collections': > >> for > >> example the PAT documents on disk 3 have an > average doc length of 3543, > >> but > >> the AP documents on disk 1 have an avg doc length > of 353. > >> > >> I have found on other collections that any > advantages of BM25's document > >> length normalization fall apart when 'average > document length' doesn't > >> make > >> a whole lot of sense (cases like this). > >> > >> For this same reason, I've only found a few > collections where BM25's doc > >> length normalization is really significantly > better than Lucene's. > >> > >> In my opinion, the results on a particular test > collection or 2 have > >> perhaps > >> been taken too far and created a myth that BM25 is > always superior to > >> Lucene's scoring... this is not true! > >> > >> On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov > <iprov...@yahoo.com> > >> wrote: > >> > >> I applied the Lucene patch mentioned in > >>> https://issues.apache.org/jira/browse/LUCENE-2091 and > ran the MAP > >>> numbers > >>> on TREC-3 collection using topics > 151-200. I am not getting worse > >>> results > >>> comparing to Lucene DefaultSimilarity. I > suspect, I am not using it > >>> correctly. I have single field > documents. This is the process I use: > >>> > >>> 1. During the indexing, I am setting the > similarity to BM25 as such: > >>> > >>> IndexWriter writer = new IndexWriter(dir, new > StandardAnalyzer( > >>> > Version.LUCENE_CURRENT), true, > >>> > IndexWriter.MaxFieldLength.UNLIMITED); > >>> writer.setSimilarity(new BM25Similarity()); > >>> > >>> 2. During the Precision/Recall measurements, I > am using a > >>> SimpleBM25QQParser extension I added to the > benchmark: > >>> > >>> QualityQueryParser qqParser = new > SimpleBM25QQParser("title", "TEXT"); > >>> > >>> > >>> 3. Here is the parser code (I set an avg doc > length here): > >>> > >>> public Query parse(QualityQuery qq) throws > ParseException { > >>> BM25Parameters.setAverageLength(indexField, > 798.30f);//avg doc length > >>> BM25Parameters.setB(0.5f);//tried > default values > >>> BM25Parameters.setK1(2f); > >>> return query = new > BM25BooleanQuery(qq.getValue(qqName), indexField, > >>> new > >>> StandardAnalyzer(Version.LUCENE_CURRENT)); > >>> } > >>> > >>> 4. The searcher is using BM25 similarity: > >>> > >>> Searcher searcher = new IndexSearcher(dir, > true); > >>> searcher.setSimilarity(sim); > >>> > >>> Am I missing some steps? Does anyone > have experience with this code? > >>> > >>> Thanks, > >>> > >>> Ivan > >>> > >>> > >>> > >>> > >>> > --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >>> For additional commands, e-mail: java-user-h...@lucene.apache.org > >>> > >>> > >>> > >> > >> > > -- > > > ----------------------------------------------------------- > > Joaquín Pérez Iglesias > > Dpto. Lenguajes y Sistemas Informáticos > > E.T.S.I. Informática (UNED) > > Ciudad Universitaria > > C/ Juan del Rosal nº 16 > > 28040 Madrid - Spain > > Phone. +34 91 398 89 19 > > Fax +34 91 398 65 35 > > Office 2.11 > > Email: joaquin.pe...@lsi.uned.es > > web: http://nlp.uned.es/~jperezi/ <http://nlp.uned.es/%7Ejperezi/> > > > ----------------------------------------------------------- > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > -- > Robert Muir > rcm...@gmail.com > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org