Re: BM25 Scoring Patch

Robert Muir Tue, 16 Feb 2010 09:23:48 -0800

cool! i saw you were using StandardAnalyzer too, maybe you want to try using
stemming also (as this analyzer does not do stemming)... usually helps.


On Tue, Feb 16, 2010 at 12:15 PM, Ivan Provalov <iprov...@yahoo.com> wrote:

> Joaquin, Robert,
>
> I followed Joaquin's recommendation and removed the call to set similarity
> to BM25 explicitly (indexer, searcher).  The results showed 55% improvement
> for the MAP score (0.141->0.219) over default similarity.
>
> Joaquin, how would setting the similarity to BM25 explicitly make the score
> worse?
>
> Thank you,
>
> Ivan
>
>
>
> --- On Tue, 2/16/10, Robert Muir <rcm...@gmail.com> wrote:
>
> > From: Robert Muir <rcm...@gmail.com>
> > Subject: Re: BM25 Scoring Patch
> > To: java-user@lucene.apache.org
> > Date: Tuesday, February 16, 2010, 11:36 AM
> > yes Ivan, if possible please report
> > back any findings you can on the
> > experiments you are doing!
> >
> > On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez Iglesias
> > <
> > joaquin.pe...@lsi.uned.es>
> > wrote:
> >
> > > Hi Ivan,
> > >
> > > You shouldn't set the BM25Similarity for indexing or
> > searching.
> > > Please try removing the lines:
> > >   writer.setSimilarity(new
> > BM25Similarity());
> > >   searcher.setSimilarity(sim);
> > >
> > > Please let us/me know if you improve your results with
> > these changes.
> > >
> > >
> > > Robert Muir escribió:
> > >
> > >  Hi Ivan, I've seen many cases where BM25
> > performs worse than Lucene's
> > >> default Similarity. Perhaps this is just another
> > one?
> > >>
> > >> Again while I have not worked with this particular
> > collection, I looked at
> > >> the statistics and noted that its composed of
> > several 'sub-collections':
> > >> for
> > >> example the PAT documents on disk 3 have an
> > average doc length of 3543,
> > >> but
> > >> the AP documents on disk 1 have an avg doc length
> > of 353.
> > >>
> > >> I have found on other collections that any
> > advantages of BM25's document
> > >> length normalization fall apart when 'average
> > document length' doesn't
> > >> make
> > >> a whole lot of sense (cases like this).
> > >>
> > >> For this same reason, I've only found a few
> > collections where BM25's doc
> > >> length normalization is really significantly
> > better than Lucene's.
> > >>
> > >> In my opinion, the results on a particular test
> > collection or 2 have
> > >> perhaps
> > >> been taken too far and created a myth that BM25 is
> > always superior to
> > >> Lucene's scoring... this is not true!
> > >>
> > >> On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov
> > <iprov...@yahoo.com>
> > >> wrote:
> > >>
> > >>  I applied the Lucene patch mentioned in
> > >>> https://issues.apache.org/jira/browse/LUCENE-2091 and
> > ran the MAP
> > >>> numbers
> > >>> on TREC-3 collection using topics
> > 151-200.  I am not getting worse
> > >>> results
> > >>> comparing to Lucene DefaultSimilarity.  I
> > suspect, I am not using it
> > >>> correctly.  I have single field
> > documents.  This is the process I use:
> > >>>
> > >>> 1. During the indexing, I am setting the
> > similarity to BM25 as such:
> > >>>
> > >>> IndexWriter writer = new IndexWriter(dir, new
> > StandardAnalyzer(
> > >>>
> >    Version.LUCENE_CURRENT), true,
> > >>>
> >    IndexWriter.MaxFieldLength.UNLIMITED);
> > >>> writer.setSimilarity(new BM25Similarity());
> > >>>
> > >>> 2. During the Precision/Recall measurements, I
> > am using a
> > >>> SimpleBM25QQParser extension I added to the
> > benchmark:
> > >>>
> > >>> QualityQueryParser qqParser = new
> > SimpleBM25QQParser("title", "TEXT");
> > >>>
> > >>>
> > >>> 3. Here is the parser code (I set an avg doc
> > length here):
> > >>>
> > >>> public Query parse(QualityQuery qq) throws
> > ParseException {
> > >>>   BM25Parameters.setAverageLength(indexField,
> > 798.30f);//avg doc length
> > >>>   BM25Parameters.setB(0.5f);//tried
> > default values
> > >>>   BM25Parameters.setK1(2f);
> > >>>   return query = new
> > BM25BooleanQuery(qq.getValue(qqName), indexField,
> > >>> new
> > >>> StandardAnalyzer(Version.LUCENE_CURRENT));
> > >>> }
> > >>>
> > >>> 4. The searcher is using BM25 similarity:
> > >>>
> > >>> Searcher searcher = new IndexSearcher(dir,
> > true);
> > >>> searcher.setSimilarity(sim);
> > >>>
> > >>> Am I missing some steps?  Does anyone
> > have experience with this code?
> > >>>
> > >>> Thanks,
> > >>>
> > >>> Ivan
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > ---------------------------------------------------------------------
> > >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > >>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >>>
> > >>>
> > >>>
> > >>
> > >>
> > > --
> > >
> > -----------------------------------------------------------
> > > Joaquín Pérez Iglesias
> > > Dpto. Lenguajes y Sistemas Informáticos
> > > E.T.S.I. Informática (UNED)
> > > Ciudad Universitaria
> > > C/ Juan del Rosal nº 16
> > > 28040 Madrid - Spain
> > > Phone. +34 91 398 89 19
> > > Fax    +34 91 398 65 35
> > > Office  2.11
> > > Email: joaquin.pe...@lsi.uned.es
> > > web:   http://nlp.uned.es/~jperezi/ <http://nlp.uned.es/%7Ejperezi/> <
> http://nlp.uned.es/%7Ejperezi/>
> > >
> > -----------------------------------------------------------
> > >
> > >
> > >
> > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
> >
> >
> > --
> > Robert Muir
> > rcm...@gmail.com
> >
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
Robert Muir
rcm...@gmail.com

Re: BM25 Scoring Patch

Reply via email to