Re: BM25 Scoring Patch

Ivan Provalov Tue, 16 Feb 2010 09:16:01 -0800

Joaquin, Robert,

I followed Joaquin's recommendation and removed the call to set similarity to 
BM25 explicitly (indexer, searcher).  The results showed 55% improvement for 
the MAP score (0.141->0.219) over default similarity.


Joaquin, how would setting the similarity to BM25 explicitly make the score 
worse?

Thank you,

Ivan



--- On Tue, 2/16/10, Robert Muir <rcm...@gmail.com> wrote:

> From: Robert Muir <rcm...@gmail.com>
> Subject: Re: BM25 Scoring Patch
> To: java-user@lucene.apache.org
> Date: Tuesday, February 16, 2010, 11:36 AM
> yes Ivan, if possible please report
> back any findings you can on the
> experiments you are doing!
> 
> On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez Iglesias
> <
> joaquin.pe...@lsi.uned.es>
> wrote:
> 
> > Hi Ivan,
> >
> > You shouldn't set the BM25Similarity for indexing or
> searching.
> > Please try removing the lines:
> >   writer.setSimilarity(new
> BM25Similarity());
> >   searcher.setSimilarity(sim);
> >
> > Please let us/me know if you improve your results with
> these changes.
> >
> >
> > Robert Muir escribió:
> >
> >  Hi Ivan, I've seen many cases where BM25
> performs worse than Lucene's
> >> default Similarity. Perhaps this is just another
> one?
> >>
> >> Again while I have not worked with this particular
> collection, I looked at
> >> the statistics and noted that its composed of
> several 'sub-collections':
> >> for
> >> example the PAT documents on disk 3 have an
> average doc length of 3543,
> >> but
> >> the AP documents on disk 1 have an avg doc length
> of 353.
> >>
> >> I have found on other collections that any
> advantages of BM25's document
> >> length normalization fall apart when 'average
> document length' doesn't
> >> make
> >> a whole lot of sense (cases like this).
> >>
> >> For this same reason, I've only found a few
> collections where BM25's doc
> >> length normalization is really significantly
> better than Lucene's.
> >>
> >> In my opinion, the results on a particular test
> collection or 2 have
> >> perhaps
> >> been taken too far and created a myth that BM25 is
> always superior to
> >> Lucene's scoring... this is not true!
> >>
> >> On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov
> <iprov...@yahoo.com>
> >> wrote:
> >>
> >>  I applied the Lucene patch mentioned in
> >>> https://issues.apache.org/jira/browse/LUCENE-2091 and
> ran the MAP
> >>> numbers
> >>> on TREC-3 collection using topics
> 151-200.  I am not getting worse
> >>> results
> >>> comparing to Lucene DefaultSimilarity.  I
> suspect, I am not using it
> >>> correctly.  I have single field
> documents.  This is the process I use:
> >>>
> >>> 1. During the indexing, I am setting the
> similarity to BM25 as such:
> >>>
> >>> IndexWriter writer = new IndexWriter(dir, new
> StandardAnalyzer(
> >>>           
>    Version.LUCENE_CURRENT), true,
> >>>           
>    IndexWriter.MaxFieldLength.UNLIMITED);
> >>> writer.setSimilarity(new BM25Similarity());
> >>>
> >>> 2. During the Precision/Recall measurements, I
> am using a
> >>> SimpleBM25QQParser extension I added to the
> benchmark:
> >>>
> >>> QualityQueryParser qqParser = new
> SimpleBM25QQParser("title", "TEXT");
> >>>
> >>>
> >>> 3. Here is the parser code (I set an avg doc
> length here):
> >>>
> >>> public Query parse(QualityQuery qq) throws
> ParseException {
> >>>   BM25Parameters.setAverageLength(indexField,
> 798.30f);//avg doc length
> >>>   BM25Parameters.setB(0.5f);//tried
> default values
> >>>   BM25Parameters.setK1(2f);
> >>>   return query = new
> BM25BooleanQuery(qq.getValue(qqName), indexField,
> >>> new
> >>> StandardAnalyzer(Version.LUCENE_CURRENT));
> >>> }
> >>>
> >>> 4. The searcher is using BM25 similarity:
> >>>
> >>> Searcher searcher = new IndexSearcher(dir,
> true);
> >>> searcher.setSimilarity(sim);
> >>>
> >>> Am I missing some steps?  Does anyone
> have experience with this code?
> >>>
> >>> Thanks,
> >>>
> >>> Ivan
> >>>
> >>>
> >>>
> >>>
> >>>
> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>>
> >>>
> >>>
> >>
> >>
> > --
> >
> -----------------------------------------------------------
> > Joaquín Pérez Iglesias
> > Dpto. Lenguajes y Sistemas Informáticos
> > E.T.S.I. Informática (UNED)
> > Ciudad Universitaria
> > C/ Juan del Rosal nº 16
> > 28040 Madrid - Spain
> > Phone. +34 91 398 89 19
> > Fax    +34 91 398 65 35
> > Office  2.11
> > Email: joaquin.pe...@lsi.uned.es
> > web:   http://nlp.uned.es/~jperezi/ <http://nlp.uned.es/%7Ejperezi/>
> >
> -----------------------------------------------------------
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
> 
> 
> -- 
> Robert Muir
> rcm...@gmail.com
> 




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: BM25 Scoring Patch

Reply via email to