cool! i saw you were using StandardAnalyzer too, maybe you want to try using stemming also (as this analyzer does not do stemming)... usually helps.
On Tue, Feb 16, 2010 at 12:15 PM, Ivan Provalov <iprov...@yahoo.com> wrote: > Joaquin, Robert, > > I followed Joaquin's recommendation and removed the call to set similarity > to BM25 explicitly (indexer, searcher). The results showed 55% improvement > for the MAP score (0.141->0.219) over default similarity. > > Joaquin, how would setting the similarity to BM25 explicitly make the score > worse? > > Thank you, > > Ivan > > > > --- On Tue, 2/16/10, Robert Muir <rcm...@gmail.com> wrote: > > > From: Robert Muir <rcm...@gmail.com> > > Subject: Re: BM25 Scoring Patch > > To: java-user@lucene.apache.org > > Date: Tuesday, February 16, 2010, 11:36 AM > > yes Ivan, if possible please report > > back any findings you can on the > > experiments you are doing! > > > > On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez Iglesias > > < > > joaquin.pe...@lsi.uned.es> > > wrote: > > > > > Hi Ivan, > > > > > > You shouldn't set the BM25Similarity for indexing or > > searching. > > > Please try removing the lines: > > > writer.setSimilarity(new > > BM25Similarity()); > > > searcher.setSimilarity(sim); > > > > > > Please let us/me know if you improve your results with > > these changes. > > > > > > > > > Robert Muir escribió: > > > > > > Hi Ivan, I've seen many cases where BM25 > > performs worse than Lucene's > > >> default Similarity. Perhaps this is just another > > one? > > >> > > >> Again while I have not worked with this particular > > collection, I looked at > > >> the statistics and noted that its composed of > > several 'sub-collections': > > >> for > > >> example the PAT documents on disk 3 have an > > average doc length of 3543, > > >> but > > >> the AP documents on disk 1 have an avg doc length > > of 353. > > >> > > >> I have found on other collections that any > > advantages of BM25's document > > >> length normalization fall apart when 'average > > document length' doesn't > > >> make > > >> a whole lot of sense (cases like this). > > >> > > >> For this same reason, I've only found a few > > collections where BM25's doc > > >> length normalization is really significantly > > better than Lucene's. > > >> > > >> In my opinion, the results on a particular test > > collection or 2 have > > >> perhaps > > >> been taken too far and created a myth that BM25 is > > always superior to > > >> Lucene's scoring... this is not true! > > >> > > >> On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov > > <iprov...@yahoo.com> > > >> wrote: > > >> > > >> I applied the Lucene patch mentioned in > > >>> https://issues.apache.org/jira/browse/LUCENE-2091 and > > ran the MAP > > >>> numbers > > >>> on TREC-3 collection using topics > > 151-200. I am not getting worse > > >>> results > > >>> comparing to Lucene DefaultSimilarity. I > > suspect, I am not using it > > >>> correctly. I have single field > > documents. This is the process I use: > > >>> > > >>> 1. During the indexing, I am setting the > > similarity to BM25 as such: > > >>> > > >>> IndexWriter writer = new IndexWriter(dir, new > > StandardAnalyzer( > > >>> > > Version.LUCENE_CURRENT), true, > > >>> > > IndexWriter.MaxFieldLength.UNLIMITED); > > >>> writer.setSimilarity(new BM25Similarity()); > > >>> > > >>> 2. During the Precision/Recall measurements, I > > am using a > > >>> SimpleBM25QQParser extension I added to the > > benchmark: > > >>> > > >>> QualityQueryParser qqParser = new > > SimpleBM25QQParser("title", "TEXT"); > > >>> > > >>> > > >>> 3. Here is the parser code (I set an avg doc > > length here): > > >>> > > >>> public Query parse(QualityQuery qq) throws > > ParseException { > > >>> BM25Parameters.setAverageLength(indexField, > > 798.30f);//avg doc length > > >>> BM25Parameters.setB(0.5f);//tried > > default values > > >>> BM25Parameters.setK1(2f); > > >>> return query = new > > BM25BooleanQuery(qq.getValue(qqName), indexField, > > >>> new > > >>> StandardAnalyzer(Version.LUCENE_CURRENT)); > > >>> } > > >>> > > >>> 4. The searcher is using BM25 similarity: > > >>> > > >>> Searcher searcher = new IndexSearcher(dir, > > true); > > >>> searcher.setSimilarity(sim); > > >>> > > >>> Am I missing some steps? Does anyone > > have experience with this code? > > >>> > > >>> Thanks, > > >>> > > >>> Ivan > > >>> > > >>> > > >>> > > >>> > > >>> > > --------------------------------------------------------------------- > > >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > >>> For additional commands, e-mail: java-user-h...@lucene.apache.org > > >>> > > >>> > > >>> > > >> > > >> > > > -- > > > > > ----------------------------------------------------------- > > > Joaquín Pérez Iglesias > > > Dpto. Lenguajes y Sistemas Informáticos > > > E.T.S.I. Informática (UNED) > > > Ciudad Universitaria > > > C/ Juan del Rosal nº 16 > > > 28040 Madrid - Spain > > > Phone. +34 91 398 89 19 > > > Fax +34 91 398 65 35 > > > Office 2.11 > > > Email: joaquin.pe...@lsi.uned.es > > > web: http://nlp.uned.es/~jperezi/ <http://nlp.uned.es/%7Ejperezi/> < > http://nlp.uned.es/%7Ejperezi/> > > > > > ----------------------------------------------------------- > > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > > > > -- > > Robert Muir > > rcm...@gmail.com > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com