By the end of the week, I will publish the results once we run the experiments on a full collection. Are you talking about the bias caused by using a sub-collection?
Thanks, Ivan --- On Tue, 2/16/10, Robert Muir <rcm...@gmail.com> wrote: > From: Robert Muir <rcm...@gmail.com> > Subject: Re: BM25 Scoring Patch > To: java-user@lucene.apache.org > Date: Tuesday, February 16, 2010, 2:11 PM > Ivan, ok. it would be cool if you can > list the map and bpref for the > different approaches you try (default lucene, lnb.ltc, > bm25), with or > without stemming. > > as you reported previously you got a 24% improvement with > lnb.btc (right?) I > am guessing that we won't be able to draw many conclusions > at all due to > bias. > > On Tue, Feb 16, 2010 at 2:01 PM, Ivan Provalov <iprov...@yahoo.com> > wrote: > > > Robert, Joaquin, > > > > Sorry, I made an error reporting the results. > The preliminary improvement > > is around 21% (it's a reduced collection). I > will have to run another test > > to get the final numbers on the complete collection. > > > > We are planning to also apply the stemming. > Right now we are trying to > > isolate each improvement experiment. > > > > Thanks, > > > > Ivan > > > > > > > > --- On Tue, 2/16/10, Robert Muir <rcm...@gmail.com> > wrote: > > > > > From: Robert Muir <rcm...@gmail.com> > > > Subject: Re: BM25 Scoring Patch > > > To: java-user@lucene.apache.org > > > Date: Tuesday, February 16, 2010, 1:14 PM > > > Ivan just a little more food for > > > thought to help you with this: > > > > > > I'm glad you got improved results, yet I stand by > my > > > original statement of > > > 'be careful' interpreting too much from one > collection. > > > > > > eg. had you chosen TREC-4 instead of TREC-3, you > would see > > > different > > > results, as vector-space with non-cosine doc > length norm > > > (LUCENE-2187) > > > performed better than BM25 there: > > > http://trec.nist.gov/pubs/trec4/overview.ps.gz > > > > > > in truth its hard to 'reuse' a pooled test > collection to > > > compare methods > > > that were not part of the pool: > > > http://www.ir.uwaterloo.ca/slides/buettcher_reliable_evaluation.pdf > > > > > > This might help explain why you see such a > difference in > > > MAP score! > > > > > > On Tue, Feb 16, 2010 at 12:15 PM, Ivan Provalov > <iprov...@yahoo.com> > > > wrote: > > > > > > > Joaquin, Robert, > > > > > > > > I followed Joaquin's recommendation and > removed the > > > call to set similarity > > > > to BM25 explicitly (indexer, > searcher). The > > > results showed 55% improvement > > > > for the MAP score (0.141->0.219) over > default > > > similarity. > > > > > > > > Joaquin, how would setting the similarity to > BM25 > > > explicitly make the score > > > > worse? > > > > > > > > Thank you, > > > > > > > > Ivan > > > > > > > > > > > > > > > > --- On Tue, 2/16/10, Robert Muir <rcm...@gmail.com> > > > wrote: > > > > > > > > > From: Robert Muir <rcm...@gmail.com> > > > > > Subject: Re: BM25 Scoring Patch > > > > > To: java-user@lucene.apache.org > > > > > Date: Tuesday, February 16, 2010, 11:36 > AM > > > > > yes Ivan, if possible please report > > > > > back any findings you can on the > > > > > experiments you are doing! > > > > > > > > > > On Tue, Feb 16, 2010 at 11:22 AM, > Joaquin Perez > > > Iglesias > > > > > < > > > > > joaquin.pe...@lsi.uned.es> > > > > > wrote: > > > > > > > > > > > Hi Ivan, > > > > > > > > > > > > You shouldn't set the > BM25Similarity for > > > indexing or > > > > > searching. > > > > > > Please try removing the lines: > > > > > > > writer.setSimilarity(new > > > > > BM25Similarity()); > > > > > > > > > > searcher.setSimilarity(sim); > > > > > > > > > > > > Please let us/me know if you > improve your > > > results with > > > > > these changes. > > > > > > > > > > > > > > > > > > Robert Muir escribió: > > > > > > > > > > > > Hi Ivan, I've seen many > cases where > > > BM25 > > > > > performs worse than Lucene's > > > > > >> default Similarity. Perhaps > this is just > > > another > > > > > one? > > > > > >> > > > > > >> Again while I have not worked > with this > > > particular > > > > > collection, I looked at > > > > > >> the statistics and noted that > its > > > composed of > > > > > several 'sub-collections': > > > > > >> for > > > > > >> example the PAT documents on > disk 3 have > > > an > > > > > average doc length of 3543, > > > > > >> but > > > > > >> the AP documents on disk 1 > have an avg > > > doc length > > > > > of 353. > > > > > >> > > > > > >> I have found on other > collections that > > > any > > > > > advantages of BM25's document > > > > > >> length normalization fall > apart when > > > 'average > > > > > document length' doesn't > > > > > >> make > > > > > >> a whole lot of sense (cases > like this). > > > > > >> > > > > > >> For this same reason, I've > only found a > > > few > > > > > collections where BM25's doc > > > > > >> length normalization is > really > > > significantly > > > > > better than Lucene's. > > > > > >> > > > > > >> In my opinion, the results on > a > > > particular test > > > > > collection or 2 have > > > > > >> perhaps > > > > > >> been taken too far and created > a myth > > > that BM25 is > > > > > always superior to > > > > > >> Lucene's scoring... this is > not true! > > > > > >> > > > > > >> On Tue, Feb 16, 2010 at 9:46 > AM, Ivan > > > Provalov > > > > > <iprov...@yahoo.com> > > > > > >> wrote: > > > > > >> > > > > > >> I applied the Lucene > patch > > > mentioned in > > > > > >>> https://issues.apache.org/jira/browse/LUCENE-2091 and > > > > > ran the MAP > > > > > >>> numbers > > > > > >>> on TREC-3 collection using > topics > > > > > 151-200. I am not getting worse > > > > > >>> results > > > > > >>> comparing to Lucene > > > DefaultSimilarity. I > > > > > suspect, I am not using it > > > > > >>> correctly. I have > single > > > field > > > > > documents. This is the process I > use: > > > > > >>> > > > > > >>> 1. During the indexing, I > am setting > > > the > > > > > similarity to BM25 as such: > > > > > >>> > > > > > >>> IndexWriter writer = new > > > IndexWriter(dir, new > > > > > StandardAnalyzer( > > > > > >>> > > > > > Version.LUCENE_CURRENT), > true, > > > > > >>> > > > > > > > > IndexWriter.MaxFieldLength.UNLIMITED); > > > > > >>> writer.setSimilarity(new > > > BM25Similarity()); > > > > > >>> > > > > > >>> 2. During the > Precision/Recall > > > measurements, I > > > > > am using a > > > > > >>> SimpleBM25QQParser > extension I added > > > to the > > > > > benchmark: > > > > > >>> > > > > > >>> QualityQueryParser > qqParser = new > > > > > SimpleBM25QQParser("title", "TEXT"); > > > > > >>> > > > > > >>> > > > > > >>> 3. Here is the parser code > (I set an > > > avg doc > > > > > length here): > > > > > >>> > > > > > >>> public Query > parse(QualityQuery qq) > > > throws > > > > > ParseException { > > > > > > > > > >>> BM25Parameters.setAverageLength(indexField, > > > > > 798.30f);//avg doc length > > > > > > > > > >>> BM25Parameters.setB(0.5f);//tried > > > > > default values > > > > > > > > > >>> BM25Parameters.setK1(2f); > > > > > >>> return > query = new > > > > > BM25BooleanQuery(qq.getValue(qqName), > > > indexField, > > > > > >>> new > > > > > >>> > > > StandardAnalyzer(Version.LUCENE_CURRENT)); > > > > > >>> } > > > > > >>> > > > > > >>> 4. The searcher is using > BM25 > > > similarity: > > > > > >>> > > > > > >>> Searcher searcher = new > > > IndexSearcher(dir, > > > > > true); > > > > > >>> > searcher.setSimilarity(sim); > > > > > >>> > > > > > >>> Am I missing some > steps? Does > > > anyone > > > > > have experience with this code? > > > > > >>> > > > > > >>> Thanks, > > > > > >>> > > > > > >>> Ivan > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > > > > > --------------------------------------------------------------------- > > > > > >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > > > >>> For additional commands, > e-mail: > > java-user-h...@lucene.apache.org > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >> > > > > > >> > > > > > > -- > > > > > > > > > > > > > > > ----------------------------------------------------------- > > > > > > Joaquín Pérez Iglesias > > > > > > Dpto. Lenguajes y Sistemas > Informáticos > > > > > > E.T.S.I. Informática (UNED) > > > > > > Ciudad Universitaria > > > > > > C/ Juan del Rosal nº 16 > > > > > > 28040 Madrid - Spain > > > > > > Phone. +34 91 398 89 19 > > > > > > Fax +34 91 398 65 35 > > > > > > Office 2.11 > > > > > > Email: joaquin.pe...@lsi.uned.es > > > > > > web: http://nlp.uned.es/~jperezi/<http://nlp.uned.es/%7Ejperezi/>< > > http://nlp.uned.es/%7Ejperezi/> < > > > > http://nlp.uned.es/%7Ejperezi/> > > > > > > > > > > > > > > > ----------------------------------------------------------- > > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > > > > For additional commands, e-mail: > java-user-h...@lucene.apache.org > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Robert Muir > > > > > rcm...@gmail.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > > > > > > > > > -- > > > Robert Muir > > > rcm...@gmail.com > > > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > -- > Robert Muir > rcm...@gmail.com > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org