Robert, Joaquin, Sorry, I made an error reporting the results. The preliminary improvement is around 21% (it's a reduced collection). I will have to run another test to get the final numbers on the complete collection.
We are planning to also apply the stemming. Right now we are trying to isolate each improvement experiment. Thanks, Ivan --- On Tue, 2/16/10, Robert Muir <rcm...@gmail.com> wrote: > From: Robert Muir <rcm...@gmail.com> > Subject: Re: BM25 Scoring Patch > To: java-user@lucene.apache.org > Date: Tuesday, February 16, 2010, 1:14 PM > Ivan just a little more food for > thought to help you with this: > > I'm glad you got improved results, yet I stand by my > original statement of > 'be careful' interpreting too much from one collection. > > eg. had you chosen TREC-4 instead of TREC-3, you would see > different > results, as vector-space with non-cosine doc length norm > (LUCENE-2187) > performed better than BM25 there: > http://trec.nist.gov/pubs/trec4/overview.ps.gz > > in truth its hard to 'reuse' a pooled test collection to > compare methods > that were not part of the pool: > http://www.ir.uwaterloo.ca/slides/buettcher_reliable_evaluation.pdf > > This might help explain why you see such a difference in > MAP score! > > On Tue, Feb 16, 2010 at 12:15 PM, Ivan Provalov <iprov...@yahoo.com> > wrote: > > > Joaquin, Robert, > > > > I followed Joaquin's recommendation and removed the > call to set similarity > > to BM25 explicitly (indexer, searcher). The > results showed 55% improvement > > for the MAP score (0.141->0.219) over default > similarity. > > > > Joaquin, how would setting the similarity to BM25 > explicitly make the score > > worse? > > > > Thank you, > > > > Ivan > > > > > > > > --- On Tue, 2/16/10, Robert Muir <rcm...@gmail.com> > wrote: > > > > > From: Robert Muir <rcm...@gmail.com> > > > Subject: Re: BM25 Scoring Patch > > > To: java-user@lucene.apache.org > > > Date: Tuesday, February 16, 2010, 11:36 AM > > > yes Ivan, if possible please report > > > back any findings you can on the > > > experiments you are doing! > > > > > > On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez > Iglesias > > > < > > > joaquin.pe...@lsi.uned.es> > > > wrote: > > > > > > > Hi Ivan, > > > > > > > > You shouldn't set the BM25Similarity for > indexing or > > > searching. > > > > Please try removing the lines: > > > > writer.setSimilarity(new > > > BM25Similarity()); > > > > > searcher.setSimilarity(sim); > > > > > > > > Please let us/me know if you improve your > results with > > > these changes. > > > > > > > > > > > > Robert Muir escribió: > > > > > > > > Hi Ivan, I've seen many cases where > BM25 > > > performs worse than Lucene's > > > >> default Similarity. Perhaps this is just > another > > > one? > > > >> > > > >> Again while I have not worked with this > particular > > > collection, I looked at > > > >> the statistics and noted that its > composed of > > > several 'sub-collections': > > > >> for > > > >> example the PAT documents on disk 3 have > an > > > average doc length of 3543, > > > >> but > > > >> the AP documents on disk 1 have an avg > doc length > > > of 353. > > > >> > > > >> I have found on other collections that > any > > > advantages of BM25's document > > > >> length normalization fall apart when > 'average > > > document length' doesn't > > > >> make > > > >> a whole lot of sense (cases like this). > > > >> > > > >> For this same reason, I've only found a > few > > > collections where BM25's doc > > > >> length normalization is really > significantly > > > better than Lucene's. > > > >> > > > >> In my opinion, the results on a > particular test > > > collection or 2 have > > > >> perhaps > > > >> been taken too far and created a myth > that BM25 is > > > always superior to > > > >> Lucene's scoring... this is not true! > > > >> > > > >> On Tue, Feb 16, 2010 at 9:46 AM, Ivan > Provalov > > > <iprov...@yahoo.com> > > > >> wrote: > > > >> > > > >> I applied the Lucene patch > mentioned in > > > >>> https://issues.apache.org/jira/browse/LUCENE-2091 and > > > ran the MAP > > > >>> numbers > > > >>> on TREC-3 collection using topics > > > 151-200. I am not getting worse > > > >>> results > > > >>> comparing to Lucene > DefaultSimilarity. I > > > suspect, I am not using it > > > >>> correctly. I have single > field > > > documents. This is the process I use: > > > >>> > > > >>> 1. During the indexing, I am setting > the > > > similarity to BM25 as such: > > > >>> > > > >>> IndexWriter writer = new > IndexWriter(dir, new > > > StandardAnalyzer( > > > >>> > > > Version.LUCENE_CURRENT), true, > > > >>> > > > > IndexWriter.MaxFieldLength.UNLIMITED); > > > >>> writer.setSimilarity(new > BM25Similarity()); > > > >>> > > > >>> 2. During the Precision/Recall > measurements, I > > > am using a > > > >>> SimpleBM25QQParser extension I added > to the > > > benchmark: > > > >>> > > > >>> QualityQueryParser qqParser = new > > > SimpleBM25QQParser("title", "TEXT"); > > > >>> > > > >>> > > > >>> 3. Here is the parser code (I set an > avg doc > > > length here): > > > >>> > > > >>> public Query parse(QualityQuery qq) > throws > > > ParseException { > > > > >>> BM25Parameters.setAverageLength(indexField, > > > 798.30f);//avg doc length > > > > >>> BM25Parameters.setB(0.5f);//tried > > > default values > > > > >>> BM25Parameters.setK1(2f); > > > >>> return query = new > > > BM25BooleanQuery(qq.getValue(qqName), > indexField, > > > >>> new > > > >>> > StandardAnalyzer(Version.LUCENE_CURRENT)); > > > >>> } > > > >>> > > > >>> 4. The searcher is using BM25 > similarity: > > > >>> > > > >>> Searcher searcher = new > IndexSearcher(dir, > > > true); > > > >>> searcher.setSimilarity(sim); > > > >>> > > > >>> Am I missing some steps? Does > anyone > > > have experience with this code? > > > >>> > > > >>> Thanks, > > > >>> > > > >>> Ivan > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > > --------------------------------------------------------------------- > > > >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > >>> For additional commands, e-mail: java-user-h...@lucene.apache.org > > > >>> > > > >>> > > > >>> > > > >> > > > >> > > > > -- > > > > > > > > ----------------------------------------------------------- > > > > Joaquín Pérez Iglesias > > > > Dpto. Lenguajes y Sistemas Informáticos > > > > E.T.S.I. Informática (UNED) > > > > Ciudad Universitaria > > > > C/ Juan del Rosal nº 16 > > > > 28040 Madrid - Spain > > > > Phone. +34 91 398 89 19 > > > > Fax +34 91 398 65 35 > > > > Office 2.11 > > > > Email: joaquin.pe...@lsi.uned.es > > > > web: http://nlp.uned.es/~jperezi/ <http://nlp.uned.es/%7Ejperezi/> < > > http://nlp.uned.es/%7Ejperezi/> > > > > > > > > ----------------------------------------------------------- > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > > > > > > > > > -- > > > Robert Muir > > > rcm...@gmail.com > > > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > -- > Robert Muir > rcm...@gmail.com > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org