Re: BM25 Scoring Patch

Ivan Provalov Tue, 16 Feb 2010 11:17:23 -0800

By the end of the week, I will publish the results once we run the experiments 
on a full collection.  Are you talking about the bias caused by using a 
sub-collection?


Thanks,

Ivan

--- On Tue, 2/16/10, Robert Muir <rcm...@gmail.com> wrote:

> From: Robert Muir <rcm...@gmail.com>
> Subject: Re: BM25 Scoring Patch
> To: java-user@lucene.apache.org
> Date: Tuesday, February 16, 2010, 2:11 PM
> Ivan, ok. it would be cool if you can
> list the map and bpref for the
> different approaches you try (default lucene, lnb.ltc,
> bm25), with or
> without stemming.
> 
> as you reported previously you got a 24% improvement with
> lnb.btc (right?) I
> am guessing that we won't be able to draw many conclusions
> at all due to
> bias.
> 
> On Tue, Feb 16, 2010 at 2:01 PM, Ivan Provalov <iprov...@yahoo.com>
> wrote:
> 
> > Robert, Joaquin,
> >
> > Sorry, I made an error reporting the results. 
> The preliminary improvement
> > is around 21% (it's a reduced collection).  I
> will have to run another test
> > to get the final numbers on the complete collection.
> >
> > We are planning to also apply the stemming. 
> Right now we are trying to
> > isolate each improvement experiment.
> >
> > Thanks,
> >
> > Ivan
> >
> >
> >
> > --- On Tue, 2/16/10, Robert Muir <rcm...@gmail.com>
> wrote:
> >
> > > From: Robert Muir <rcm...@gmail.com>
> > > Subject: Re: BM25 Scoring Patch
> > > To: java-user@lucene.apache.org
> > > Date: Tuesday, February 16, 2010, 1:14 PM
> > > Ivan just a little more food for
> > > thought to help you with this:
> > >
> > > I'm glad you got improved results, yet I stand by
> my
> > > original statement of
> > > 'be careful' interpreting too much from one
> collection.
> > >
> > > eg. had you chosen TREC-4 instead of TREC-3, you
> would see
> > > different
> > > results, as vector-space with non-cosine doc
> length norm
> > > (LUCENE-2187)
> > > performed better than BM25 there:
> > > http://trec.nist.gov/pubs/trec4/overview.ps.gz
> > >
> > > in truth its hard to 'reuse' a pooled test
> collection to
> > > compare methods
> > > that were not part of the pool:
> > > http://www.ir.uwaterloo.ca/slides/buettcher_reliable_evaluation.pdf
> > >
> > > This might help explain why you see such a
> difference in
> > > MAP score!
> > >
> > > On Tue, Feb 16, 2010 at 12:15 PM, Ivan Provalov
> <iprov...@yahoo.com>
> > > wrote:
> > >
> > > > Joaquin, Robert,
> > > >
> > > > I followed Joaquin's recommendation and
> removed the
> > > call to set similarity
> > > > to BM25 explicitly (indexer,
> searcher).  The
> > > results showed 55% improvement
> > > > for the MAP score (0.141->0.219) over
> default
> > > similarity.
> > > >
> > > > Joaquin, how would setting the similarity to
> BM25
> > > explicitly make the score
> > > > worse?
> > > >
> > > > Thank you,
> > > >
> > > > Ivan
> > > >
> > > >
> > > >
> > > > --- On Tue, 2/16/10, Robert Muir <rcm...@gmail.com>
> > > wrote:
> > > >
> > > > > From: Robert Muir <rcm...@gmail.com>
> > > > > Subject: Re: BM25 Scoring Patch
> > > > > To: java-user@lucene.apache.org
> > > > > Date: Tuesday, February 16, 2010, 11:36
> AM
> > > > > yes Ivan, if possible please report
> > > > > back any findings you can on the
> > > > > experiments you are doing!
> > > > >
> > > > > On Tue, Feb 16, 2010 at 11:22 AM,
> Joaquin Perez
> > > Iglesias
> > > > > <
> > > > > joaquin.pe...@lsi.uned.es>
> > > > > wrote:
> > > > >
> > > > > > Hi Ivan,
> > > > > >
> > > > > > You shouldn't set the
> BM25Similarity for
> > > indexing or
> > > > > searching.
> > > > > > Please try removing the lines:
> > > > >
> >   writer.setSimilarity(new
> > > > > BM25Similarity());
> > > > >
> > >
> >   searcher.setSimilarity(sim);
> > > > > >
> > > > > > Please let us/me know if you
> improve your
> > > results with
> > > > > these changes.
> > > > > >
> > > > > >
> > > > > > Robert Muir escribió:
> > > > > >
> > > > > >  Hi Ivan, I've seen many
> cases where
> > > BM25
> > > > > performs worse than Lucene's
> > > > > >> default Similarity. Perhaps
> this is just
> > > another
> > > > > one?
> > > > > >>
> > > > > >> Again while I have not worked
> with this
> > > particular
> > > > > collection, I looked at
> > > > > >> the statistics and noted that
> its
> > > composed of
> > > > > several 'sub-collections':
> > > > > >> for
> > > > > >> example the PAT documents on
> disk 3 have
> > > an
> > > > > average doc length of 3543,
> > > > > >> but
> > > > > >> the AP documents on disk 1
> have an avg
> > > doc length
> > > > > of 353.
> > > > > >>
> > > > > >> I have found on other
> collections that
> > > any
> > > > > advantages of BM25's document
> > > > > >> length normalization fall
> apart when
> > > 'average
> > > > > document length' doesn't
> > > > > >> make
> > > > > >> a whole lot of sense (cases
> like this).
> > > > > >>
> > > > > >> For this same reason, I've
> only found a
> > > few
> > > > > collections where BM25's doc
> > > > > >> length normalization is
> really
> > > significantly
> > > > > better than Lucene's.
> > > > > >>
> > > > > >> In my opinion, the results on
> a
> > > particular test
> > > > > collection or 2 have
> > > > > >> perhaps
> > > > > >> been taken too far and created
> a myth
> > > that BM25 is
> > > > > always superior to
> > > > > >> Lucene's scoring... this is
> not true!
> > > > > >>
> > > > > >> On Tue, Feb 16, 2010 at 9:46
> AM, Ivan
> > > Provalov
> > > > > <iprov...@yahoo.com>
> > > > > >> wrote:
> > > > > >>
> > > > > >>  I applied the Lucene
> patch
> > > mentioned in
> > > > > >>> https://issues.apache.org/jira/browse/LUCENE-2091 and
> > > > > ran the MAP
> > > > > >>> numbers
> > > > > >>> on TREC-3 collection using
> topics
> > > > > 151-200.  I am not getting worse
> > > > > >>> results
> > > > > >>> comparing to Lucene
> > > DefaultSimilarity.  I
> > > > > suspect, I am not using it
> > > > > >>> correctly.  I have
> single
> > > field
> > > > > documents.  This is the process I
> use:
> > > > > >>>
> > > > > >>> 1. During the indexing, I
> am setting
> > > the
> > > > > similarity to BM25 as such:
> > > > > >>>
> > > > > >>> IndexWriter writer = new
> > > IndexWriter(dir, new
> > > > > StandardAnalyzer(
> > > > > >>>
> > > > >    Version.LUCENE_CURRENT),
> true,
> > > > > >>>
> > > > >
> > > IndexWriter.MaxFieldLength.UNLIMITED);
> > > > > >>> writer.setSimilarity(new
> > > BM25Similarity());
> > > > > >>>
> > > > > >>> 2. During the
> Precision/Recall
> > > measurements, I
> > > > > am using a
> > > > > >>> SimpleBM25QQParser
> extension I added
> > > to the
> > > > > benchmark:
> > > > > >>>
> > > > > >>> QualityQueryParser
> qqParser = new
> > > > > SimpleBM25QQParser("title", "TEXT");
> > > > > >>>
> > > > > >>>
> > > > > >>> 3. Here is the parser code
> (I set an
> > > avg doc
> > > > > length here):
> > > > > >>>
> > > > > >>> public Query
> parse(QualityQuery qq)
> > > throws
> > > > > ParseException {
> > > > >
> > >
> >>>   BM25Parameters.setAverageLength(indexField,
> > > > > 798.30f);//avg doc length
> > > > >
> > >
> >>>   BM25Parameters.setB(0.5f);//tried
> > > > > default values
> > > > >
> > >
> >>>   BM25Parameters.setK1(2f);
> > > > > >>>   return
> query = new
> > > > > BM25BooleanQuery(qq.getValue(qqName),
> > > indexField,
> > > > > >>> new
> > > > > >>>
> > > StandardAnalyzer(Version.LUCENE_CURRENT));
> > > > > >>> }
> > > > > >>>
> > > > > >>> 4. The searcher is using
> BM25
> > > similarity:
> > > > > >>>
> > > > > >>> Searcher searcher = new
> > > IndexSearcher(dir,
> > > > > true);
> > > > > >>>
> searcher.setSimilarity(sim);
> > > > > >>>
> > > > > >>> Am I missing some
> steps?  Does
> > > anyone
> > > > > have experience with this code?
> > > > > >>>
> > > > > >>> Thanks,
> > > > > >>>
> > > > > >>> Ivan
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > >
> > >
> ---------------------------------------------------------------------
> > > > > >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > > > >>> For additional commands,
> e-mail:
> > java-user-h...@lucene.apache.org
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>
> > > > > >>
> > > > > > --
> > > > > >
> > > > >
> > >
> -----------------------------------------------------------
> > > > > > Joaquín Pérez Iglesias
> > > > > > Dpto. Lenguajes y Sistemas
> Informáticos
> > > > > > E.T.S.I. Informática (UNED)
> > > > > > Ciudad Universitaria
> > > > > > C/ Juan del Rosal nº 16
> > > > > > 28040 Madrid - Spain
> > > > > > Phone. +34 91 398 89 19
> > > > > > Fax    +34 91 398 65 35
> > > > > > Office  2.11
> > > > > > Email: joaquin.pe...@lsi.uned.es
> > > > > > web:   http://nlp.uned.es/~jperezi/<http://nlp.uned.es/%7Ejperezi/><
> > http://nlp.uned.es/%7Ejperezi/> <
> > > > http://nlp.uned.es/%7Ejperezi/>
> > > > > >
> > > > >
> > >
> -----------------------------------------------------------
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > >
> ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > > > > For additional commands, e-mail:
> java-user-h...@lucene.apache.org
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Robert Muir
> > > > > rcm...@gmail.com
> > > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > > >
> > > >
> > >
> > >
> > > --
> > > Robert Muir
> > > rcm...@gmail.com
> > >
> >
> >
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
> 
> 
> -- 
> Robert Muir
> rcm...@gmail.com
> 




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: BM25 Scoring Patch

Reply via email to