Re: BM25 Scoring Patch

Robert Muir Tue, 16 Feb 2010 11:08:09 -0800

I don't think its really a competition, I think preferably we should have
the flexibility to change the scoring model in lucene actually?


I have found lots of cases where VSM improves on BM25, but then again I
don't work with TREC stuff, as I work with non-english collections.

It doesn't contradict years of research to say that VSM isn't a
state-of-the-art model, besides the TREC-4 results, there are CLEF results
where VSM models perform competitively or exceed (Finnish, Russian, etc)
BM25/DFR/etc.

It depends on the collection, there isn't a 'best retrieval formula'.

Note: I have no bias against BM-25, but its definitely a myth to say there
is a single retrieval formula that is the 'best' across the board.


On Tue, Feb 16, 2010 at 1:53 PM, JOAQUIN PEREZ IGLESIAS <
joaquin.pe...@lsi.uned.es> wrote:

> By the way,
>
> I don't want to start a flame war VSM vs BM25, but I really believe that I
> have to express my opinion as Robert has done. In my experience, I have
> never found a case where VSM improves significantly BM25. Maybe you can
> find some cases under some very specific collection characteristics, (as
> average length of 300 vs 3000) or a bad usage of BM25 (not proper
> parameters) where it can happen.
>
> BM25 is not just only a different way of length normalization, it is based
> strongly in the probabilistic framework, and parametrises frequencies and
> length. This is probably the most successful ranking model of the last
> years in Information Retrieval.
>
> I have never read a paper where VSM  improves any of the state-of-the-art
> ranking models (Language Models, DFR, BM25,...),  although the VSM with
> pivoted normalisation length can obtain nice results. This can be proved
> checking the last years of the TREC competition.
>
> Honestly to say that is a myth that BM25 improves VSM breaks the last 10
> or 15 years of research on Information Retrieval, and I really believe
> that is not accurate.
>
> The good thing of Information Retrieval is that you can always make your
> owns experiments and you can use the experience of a lot of years of
> research.
>
> PS: This opinion is based on experiments on TREC and CLEF collections,
> obviously we can start a debate about the suitability of this type of
> experimentation (concept of relevance, pooling, relevance judgements), but
> this is a much more complex topic and I believe is far from what we are
> dealing here.
>
> PS2: In relation with TREC4 Cornell used a pivoted length normalisation
> and they were applying pseudo-relevance feedback, what honestly makes much
> more difficult the analysis of the results. Obviously their results were
> part of the pool.
>
> Sorry for the huge mail :-))))
>
> > Hi Ivan,
> >
> > the problem is that unfortunately BM25
> > cannot be implemented overwriting
> > the Similarity interface. Therefore BM25Similarity
> > only computes the classic probabilistic IDF (what is
> > interesting only at search time).
> > If you set BM25Similarity at indexing time
> > some basic stats are not stored
> > correctly in the segments (like docs length).
> >
> > When you use BM25BooleanQuery this class
> > will set automatically the BM25Similarity for you,
> > therefore you don't need to do this explicitly.
> >
> > I tried to make this implementation with the focus on
> > not interfering on the typical use of Lucene (so no changing
> > DefaultSimilarity).
> >
> >> Joaquin, Robert,
> >>
> >> I followed Joaquin's recommendation and removed the call to set
> >> similarity
> >> to BM25 explicitly (indexer, searcher).  The results showed 55%
> >> improvement for the MAP score (0.141->0.219) over default similarity.
> >>
> >> Joaquin, how would setting the similarity to BM25 explicitly make the
> >> score worse?
> >>
> >> Thank you,
> >>
> >> Ivan
> >>
> >>
> >>
> >> --- On Tue, 2/16/10, Robert Muir <rcm...@gmail.com> wrote:
> >>
> >>> From: Robert Muir <rcm...@gmail.com>
> >>> Subject: Re: BM25 Scoring Patch
> >>> To: java-user@lucene.apache.org
> >>> Date: Tuesday, February 16, 2010, 11:36 AM
> >>> yes Ivan, if possible please report
> >>> back any findings you can on the
> >>> experiments you are doing!
> >>>
> >>> On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez Iglesias
> >>> <
> >>> joaquin.pe...@lsi.uned.es>
> >>> wrote:
> >>>
> >>> > Hi Ivan,
> >>> >
> >>> > You shouldn't set the BM25Similarity for indexing or
> >>> searching.
> >>> > Please try removing the lines:
> >>> >   writer.setSimilarity(new
> >>> BM25Similarity());
> >>> >   searcher.setSimilarity(sim);
> >>> >
> >>> > Please let us/me know if you improve your results with
> >>> these changes.
> >>> >
> >>> >
> >>> > Robert Muir escribió:
> >>> >
> >>> >  Hi Ivan, I've seen many cases where BM25
> >>> performs worse than Lucene's
> >>> >> default Similarity. Perhaps this is just another
> >>> one?
> >>> >>
> >>> >> Again while I have not worked with this particular
> >>> collection, I looked at
> >>> >> the statistics and noted that its composed of
> >>> several 'sub-collections':
> >>> >> for
> >>> >> example the PAT documents on disk 3 have an
> >>> average doc length of 3543,
> >>> >> but
> >>> >> the AP documents on disk 1 have an avg doc length
> >>> of 353.
> >>> >>
> >>> >> I have found on other collections that any
> >>> advantages of BM25's document
> >>> >> length normalization fall apart when 'average
> >>> document length' doesn't
> >>> >> make
> >>> >> a whole lot of sense (cases like this).
> >>> >>
> >>> >> For this same reason, I've only found a few
> >>> collections where BM25's doc
> >>> >> length normalization is really significantly
> >>> better than Lucene's.
> >>> >>
> >>> >> In my opinion, the results on a particular test
> >>> collection or 2 have
> >>> >> perhaps
> >>> >> been taken too far and created a myth that BM25 is
> >>> always superior to
> >>> >> Lucene's scoring... this is not true!
> >>> >>
> >>> >> On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov
> >>> <iprov...@yahoo.com>
> >>> >> wrote:
> >>> >>
> >>> >>  I applied the Lucene patch mentioned in
> >>> >>> https://issues.apache.org/jira/browse/LUCENE-2091 and
> >>> ran the MAP
> >>> >>> numbers
> >>> >>> on TREC-3 collection using topics
> >>> 151-200.  I am not getting worse
> >>> >>> results
> >>> >>> comparing to Lucene DefaultSimilarity.  I
> >>> suspect, I am not using it
> >>> >>> correctly.  I have single field
> >>> documents.  This is the process I use:
> >>> >>>
> >>> >>> 1. During the indexing, I am setting the
> >>> similarity to BM25 as such:
> >>> >>>
> >>> >>> IndexWriter writer = new IndexWriter(dir, new
> >>> StandardAnalyzer(
> >>> >>>
> >>>    Version.LUCENE_CURRENT), true,
> >>> >>>
> >>>    IndexWriter.MaxFieldLength.UNLIMITED);
> >>> >>> writer.setSimilarity(new BM25Similarity());
> >>> >>>
> >>> >>> 2. During the Precision/Recall measurements, I
> >>> am using a
> >>> >>> SimpleBM25QQParser extension I added to the
> >>> benchmark:
> >>> >>>
> >>> >>> QualityQueryParser qqParser = new
> >>> SimpleBM25QQParser("title", "TEXT");
> >>> >>>
> >>> >>>
> >>> >>> 3. Here is the parser code (I set an avg doc
> >>> length here):
> >>> >>>
> >>> >>> public Query parse(QualityQuery qq) throws
> >>> ParseException {
> >>> >>>   BM25Parameters.setAverageLength(indexField,
> >>> 798.30f);//avg doc length
> >>> >>>   BM25Parameters.setB(0.5f);//tried
> >>> default values
> >>> >>>   BM25Parameters.setK1(2f);
> >>> >>>   return query = new
> >>> BM25BooleanQuery(qq.getValue(qqName), indexField,
> >>> >>> new
> >>> >>> StandardAnalyzer(Version.LUCENE_CURRENT));
> >>> >>> }
> >>> >>>
> >>> >>> 4. The searcher is using BM25 similarity:
> >>> >>>
> >>> >>> Searcher searcher = new IndexSearcher(dir,
> >>> true);
> >>> >>> searcher.setSimilarity(sim);
> >>> >>>
> >>> >>> Am I missing some steps?  Does anyone
> >>> have experience with this code?
> >>> >>>
> >>> >>> Thanks,
> >>> >>>
> >>> >>> Ivan
> >>> >>>
> >>> >>>
> >>> >>>
> >>> >>>
> >>> >>>
> >>> ---------------------------------------------------------------------
> >>> >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >>> >>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>> >>>
> >>> >>>
> >>> >>>
> >>> >>
> >>> >>
> >>> > --
> >>> >
> >>> -----------------------------------------------------------
> >>> > Joaquín Pérez Iglesias
> >>> > Dpto. Lenguajes y Sistemas Informáticos
> >>> > E.T.S.I. Informática (UNED)
> >>> > Ciudad Universitaria
> >>> > C/ Juan del Rosal nº 16
> >>> > 28040 Madrid - Spain
> >>> > Phone. +34 91 398 89 19
> >>> > Fax    +34 91 398 65 35
> >>> > Office  2.11
> >>> > Email: joaquin.pe...@lsi.uned.es
> >>> > web:   http://nlp.uned.es/~jperezi/ <http://nlp.uned.es/%7Ejperezi/><
> http://nlp.uned.es/%7Ejperezi/>
> >>> >
> >>> -----------------------------------------------------------
> >>> >
> >>> >
> >>> >
> >>> ---------------------------------------------------------------------
> >>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>> >
> >>> >
> >>>
> >>>
> >>> --
> >>> Robert Muir
> >>> rcm...@gmail.com
> >>>
> >>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
Robert Muir
rcm...@gmail.com

Re: BM25 Scoring Patch

Reply via email to