Re: BM25 Scoring Patch

JOAQUIN PEREZ IGLESIAS Tue, 16 Feb 2010 11:40:42 -0800

Just some final comments (as I said I'm not interested in flame wars),

If I obtain better results there are not problem with pooling otherwise it
is biased.
The only important thing (in my opinion) is that it cannot be said that
BM25 is a myth.
Yes, you are right there is not an only ranking model that beats the rest,
but there are models that generally show a better performance in more
cases.


About CLEF I have had the same experience (VSM vs BM25) on Spanish and
English (WebCLEF) and Q&A (ResPubliQA)

Ivan checks the parameters (b and k1), probably you can improve your
results. (that's the bad part of BM25).

Finally we are just speaking of personal experience, so obviously you
should use the best model for your data and your own experience, on IR
there are not myths neither best ranking models. If any of us is able to
find the &#8220;best&#8221;  ranking model, or is able to prove that any
state-of-the art is a myth he should send these results to the SIGIR
conference.

Ivan, Robert good luck with your experiments, as I said the good part of
IR is that you can always make experiments on your own.

> I don't think its really a competition, I think preferably we should have
> the flexibility to change the scoring model in lucene actually?
>
> I have found lots of cases where VSM improves on BM25, but then again I
> don't work with TREC stuff, as I work with non-english collections.
>
> It doesn't contradict years of research to say that VSM isn't a
> state-of-the-art model, besides the TREC-4 results, there are CLEF results
> where VSM models perform competitively or exceed (Finnish, Russian, etc)
> BM25/DFR/etc.
>
> It depends on the collection, there isn't a 'best retrieval formula'.
>
> Note: I have no bias against BM-25, but its definitely a myth to say there
> is a single retrieval formula that is the 'best' across the board.
>
>
> On Tue, Feb 16, 2010 at 1:53 PM, JOAQUIN PEREZ IGLESIAS <
> joaquin.pe...@lsi.uned.es> wrote:
>
>> By the way,
>>
>> I don't want to start a flame war VSM vs BM25, but I really believe that
>> I
>> have to express my opinion as Robert has done. In my experience, I have
>> never found a case where VSM improves significantly BM25. Maybe you can
>> find some cases under some very specific collection characteristics, (as
>> average length of 300 vs 3000) or a bad usage of BM25 (not proper
>> parameters) where it can happen.
>>
>> BM25 is not just only a different way of length normalization, it is
>> based
>> strongly in the probabilistic framework, and parametrises frequencies
>> and
>> length. This is probably the most successful ranking model of the last
>> years in Information Retrieval.
>>
>> I have never read a paper where VSM  improves any of the
>> state-of-the-art
>> ranking models (Language Models, DFR, BM25,...),  although the VSM with
>> pivoted normalisation length can obtain nice results. This can be proved
>> checking the last years of the TREC competition.
>>
>> Honestly to say that is a myth that BM25 improves VSM breaks the last 10
>> or 15 years of research on Information Retrieval, and I really believe
>> that is not accurate.
>>
>> The good thing of Information Retrieval is that you can always make your
>> owns experiments and you can use the experience of a lot of years of
>> research.
>>
>> PS: This opinion is based on experiments on TREC and CLEF collections,
>> obviously we can start a debate about the suitability of this type of
>> experimentation (concept of relevance, pooling, relevance judgements),
>> but
>> this is a much more complex topic and I believe is far from what we are
>> dealing here.
>>
>> PS2: In relation with TREC4 Cornell used a pivoted length normalisation
>> and they were applying pseudo-relevance feedback, what honestly makes
>> much
>> more difficult the analysis of the results. Obviously their results were
>> part of the pool.
>>
>> Sorry for the huge mail :-))))
>>
>> > Hi Ivan,
>> >
>> > the problem is that unfortunately BM25
>> > cannot be implemented overwriting
>> > the Similarity interface. Therefore BM25Similarity
>> > only computes the classic probabilistic IDF (what is
>> > interesting only at search time).
>> > If you set BM25Similarity at indexing time
>> > some basic stats are not stored
>> > correctly in the segments (like docs length).
>> >
>> > When you use BM25BooleanQuery this class
>> > will set automatically the BM25Similarity for you,
>> > therefore you don't need to do this explicitly.
>> >
>> > I tried to make this implementation with the focus on
>> > not interfering on the typical use of Lucene (so no changing
>> > DefaultSimilarity).
>> >
>> >> Joaquin, Robert,
>> >>
>> >> I followed Joaquin's recommendation and removed the call to set
>> >> similarity
>> >> to BM25 explicitly (indexer, searcher).  The results showed 55%
>> >> improvement for the MAP score (0.141->0.219) over default similarity.
>> >>
>> >> Joaquin, how would setting the similarity to BM25 explicitly make the
>> >> score worse?
>> >>
>> >> Thank you,
>> >>
>> >> Ivan
>> >>
>> >>
>> >>
>> >> --- On Tue, 2/16/10, Robert Muir <rcm...@gmail.com> wrote:
>> >>
>> >>> From: Robert Muir <rcm...@gmail.com>
>> >>> Subject: Re: BM25 Scoring Patch
>> >>> To: java-user@lucene.apache.org
>> >>> Date: Tuesday, February 16, 2010, 11:36 AM
>> >>> yes Ivan, if possible please report
>> >>> back any findings you can on the
>> >>> experiments you are doing!
>> >>>
>> >>> On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez Iglesias
>> >>> <
>> >>> joaquin.pe...@lsi.uned.es>
>> >>> wrote:
>> >>>
>> >>> > Hi Ivan,
>> >>> >
>> >>> > You shouldn't set the BM25Similarity for indexing or
>> >>> searching.
>> >>> > Please try removing the lines:
>> >>> >   writer.setSimilarity(new
>> >>> BM25Similarity());
>> >>> >   searcher.setSimilarity(sim);
>> >>> >
>> >>> > Please let us/me know if you improve your results with
>> >>> these changes.
>> >>> >
>> >>> >
>> >>> > Robert Muir escribiÃ³:
>> >>> >
>> >>> >  Hi Ivan, I've seen many cases where BM25
>> >>> performs worse than Lucene's
>> >>> >> default Similarity. Perhaps this is just another
>> >>> one?
>> >>> >>
>> >>> >> Again while I have not worked with this particular
>> >>> collection, I looked at
>> >>> >> the statistics and noted that its composed of
>> >>> several 'sub-collections':
>> >>> >> for
>> >>> >> example the PAT documents on disk 3 have an
>> >>> average doc length of 3543,
>> >>> >> but
>> >>> >> the AP documents on disk 1 have an avg doc length
>> >>> of 353.
>> >>> >>
>> >>> >> I have found on other collections that any
>> >>> advantages of BM25's document
>> >>> >> length normalization fall apart when 'average
>> >>> document length' doesn't
>> >>> >> make
>> >>> >> a whole lot of sense (cases like this).
>> >>> >>
>> >>> >> For this same reason, I've only found a few
>> >>> collections where BM25's doc
>> >>> >> length normalization is really significantly
>> >>> better than Lucene's.
>> >>> >>
>> >>> >> In my opinion, the results on a particular test
>> >>> collection or 2 have
>> >>> >> perhaps
>> >>> >> been taken too far and created a myth that BM25 is
>> >>> always superior to
>> >>> >> Lucene's scoring... this is not true!
>> >>> >>
>> >>> >> On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov
>> >>> <iprov...@yahoo.com>
>> >>> >> wrote:
>> >>> >>
>> >>> >>  I applied the Lucene patch mentioned in
>> >>> >>> https://issues.apache.org/jira/browse/LUCENE-2091 and
>> >>> ran the MAP
>> >>> >>> numbers
>> >>> >>> on TREC-3 collection using topics
>> >>> 151-200.  I am not getting worse
>> >>> >>> results
>> >>> >>> comparing to Lucene DefaultSimilarity.  I
>> >>> suspect, I am not using it
>> >>> >>> correctly.  I have single field
>> >>> documents.  This is the process I use:
>> >>> >>>
>> >>> >>> 1. During the indexing, I am setting the
>> >>> similarity to BM25 as such:
>> >>> >>>
>> >>> >>> IndexWriter writer = new IndexWriter(dir, new
>> >>> StandardAnalyzer(
>> >>> >>>
>> >>>    Version.LUCENE_CURRENT), true,
>> >>> >>>
>> >>>    IndexWriter.MaxFieldLength.UNLIMITED);
>> >>> >>> writer.setSimilarity(new BM25Similarity());
>> >>> >>>
>> >>> >>> 2. During the Precision/Recall measurements, I
>> >>> am using a
>> >>> >>> SimpleBM25QQParser extension I added to the
>> >>> benchmark:
>> >>> >>>
>> >>> >>> QualityQueryParser qqParser = new
>> >>> SimpleBM25QQParser("title", "TEXT");
>> >>> >>>
>> >>> >>>
>> >>> >>> 3. Here is the parser code (I set an avg doc
>> >>> length here):
>> >>> >>>
>> >>> >>> public Query parse(QualityQuery qq) throws
>> >>> ParseException {
>> >>> >>>   BM25Parameters.setAverageLength(indexField,
>> >>> 798.30f);//avg doc length
>> >>> >>>   BM25Parameters.setB(0.5f);//tried
>> >>> default values
>> >>> >>>   BM25Parameters.setK1(2f);
>> >>> >>>   return query = new
>> >>> BM25BooleanQuery(qq.getValue(qqName), indexField,
>> >>> >>> new
>> >>> >>> StandardAnalyzer(Version.LUCENE_CURRENT));
>> >>> >>> }
>> >>> >>>
>> >>> >>> 4. The searcher is using BM25 similarity:
>> >>> >>>
>> >>> >>> Searcher searcher = new IndexSearcher(dir,
>> >>> true);
>> >>> >>> searcher.setSimilarity(sim);
>> >>> >>>
>> >>> >>> Am I missing some steps?  Does anyone
>> >>> have experience with this code?
>> >>> >>>
>> >>> >>> Thanks,
>> >>> >>>
>> >>> >>> Ivan
>> >>> >>>
>> >>> >>>
>> >>> >>>
>> >>> >>>
>> >>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >>> >>> For additional commands, e-mail:
>> java-user-h...@lucene.apache.org
>> >>> >>>
>> >>> >>>
>> >>> >>>
>> >>> >>
>> >>> >>
>> >>> > --
>> >>> >
>> >>> -----------------------------------------------------------
>> >>> > JoaquÃn PÃ©rez Iglesias
>> >>> > Dpto. Lenguajes y Sistemas InformÃ¡ticos
>> >>> > E.T.S.I. InformÃ¡tica (UNED)
>> >>> > Ciudad Universitaria
>> >>> > C/ Juan del Rosal nÂº 16
>> >>> > 28040 Madrid - Spain
>> >>> > Phone. +34 91 398 89 19
>> >>> > Fax    +34 91 398 65 35
>> >>> > Office  2.11
>> >>> > Email: joaquin.pe...@lsi.uned.es
>> >>> > web:   http://nlp.uned.es/~jperezi/
>> <http://nlp.uned.es/%7Ejperezi/><
>> http://nlp.uned.es/%7Ejperezi/>
>> >>> >
>> >>> -----------------------------------------------------------
>> >>> >
>> >>> >
>> >>> >
>> >>> ---------------------------------------------------------------------
>> >>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>> >
>> >>> >
>> >>>
>> >>>
>> >>> --
>> >>> Robert Muir
>> >>> rcm...@gmail.com
>> >>>
>> >>
>> >>
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>
>> >>
>> >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >
>> >
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
>
> --
> Robert Muir
> rcm...@gmail.com
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: BM25 Scoring Patch

Reply via email to