Joaquin, I have a typical methodology where I don't optimize any scoring params: be it BM25 params (I stick with your defaults), or lnb.ltc params (i stick with default slope). When doing query expansion i don't modify the defaults for MoreLikeThis either.
I've found that changing these params can have a significant difference in retrieval performance, which is interesting, but I'm typically focused on text analysis (how is the text indexed?/stemming/stopwords). I also feel that such things are corpus-specific, which i generally try to avoid in my work... for example, in analysis work, the text collection often has a majority of text in a specific tense (i.e. news), so i don't at all try to tune any part of analysis as I worry this would be corpus-specific... I do the same with scoring. As far as why some models perform better than others for certain languages, I think this is a million-dollar question. But my intuition (I don't have references or anything to back this up), is that probabilistic models outperform vector-space models when you are using approaches like n-grams: you don't have nice stopwords lists, stemming, decompounding etc. This is particularly interesting to me, as probabilistic model + ngram is a very general multilingual approach that I would like to have working well in Lucene, its also important as a "default" when we don't have a nicely tuned analyzer available that will work well with a vector space model. In my opinion, vector-space tends to fall apart without good language support. On Tue, Feb 16, 2010 at 3:23 PM, JOAQUIN PEREZ IGLESIAS < joaquin.pe...@lsi.uned.es> wrote: > Ok, > > I'm not advocating the BM25 patch neither, unfortunately BM25 was not my > idea :-))), and I'm sure that the implementation can be improved. > > When you use the BM25 implementation, are you optimising the parameters > specifically per collection? (It is a key factor for improving BM25 > performance). > > Why do you think that BM25 works better for English than in other > languages (apart of experiments). What are your intuitions? > > I dont't have too much experience on languages moreover of Spanish and > English, and it sounds pretty interesting. > > Kind Regards. > > P.S: Maybe this is not a topic for this list??? > > > > Joaquin, I don't see this as a flame war? First of all I'd like to > > personally thank you for your excellent BM25 implementation! > > > > I think the selection of a retrieval model depends highly on the > > language/indexing approach, i.e. if we were talking East Asian languages > I > > think we want a probabilistic model: no argument there! > > > > All i said was that it is a myth that BM25 is "always" better than > > Lucene's > > scoring model, it really depends on what you are trying to do, how you > are > > indexing your text, properties of your corpus, how your queries are > > running. > > > > I don't even want to come across as advocating the lnb.ltc approach > > either, > > sure I wrote the patch, but this means nothing. I only like it as its > > currently a simple integration into Lucene, but long-term its best if we > > can > > support other models also! > > > > Finally I think there is something to be said for Lucene's default > > retrieval > > model, which in my (non-english) findings across the board isn't terrible > > at > > all... then again I am working with languages where analysis is really > the > > thing holding Lucene back, not scoring. > > > > On Tue, Feb 16, 2010 at 2:40 PM, JOAQUIN PEREZ IGLESIAS < > > joaquin.pe...@lsi.uned.es> wrote: > > > >> Just some final comments (as I said I'm not interested in flame wars), > >> > >> If I obtain better results there are not problem with pooling otherwise > >> it > >> is biased. > >> The only important thing (in my opinion) is that it cannot be said that > >> BM25 is a myth. > >> Yes, you are right there is not an only ranking model that beats the > >> rest, > >> but there are models that generally show a better performance in more > >> cases. > >> > >> About CLEF I have had the same experience (VSM vs BM25) on Spanish and > >> English (WebCLEF) and Q&A (ResPubliQA) > >> > >> Ivan checks the parameters (b and k1), probably you can improve your > >> results. (that's the bad part of BM25). > >> > >> Finally we are just speaking of personal experience, so obviously you > >> should use the best model for your data and your own experience, on IR > >> there are not myths neither best ranking models. If any of us is able to > >> find the “best” ranking model, or is able to prove that any > >> state-of-the art is a myth he should send these results to the SIGIR > >> conference. > >> > >> Ivan, Robert good luck with your experiments, as I said the good part of > >> IR is that you can always make experiments on your own. > >> > >> > I don't think its really a competition, I think preferably we should > >> have > >> > the flexibility to change the scoring model in lucene actually? > >> > > >> > I have found lots of cases where VSM improves on BM25, but then again > >> I > >> > don't work with TREC stuff, as I work with non-english collections. > >> > > >> > It doesn't contradict years of research to say that VSM isn't a > >> > state-of-the-art model, besides the TREC-4 results, there are CLEF > >> results > >> > where VSM models perform competitively or exceed (Finnish, Russian, > >> etc) > >> > BM25/DFR/etc. > >> > > >> > It depends on the collection, there isn't a 'best retrieval formula'. > >> > > >> > Note: I have no bias against BM-25, but its definitely a myth to say > >> there > >> > is a single retrieval formula that is the 'best' across the board. > >> > > >> > > >> > On Tue, Feb 16, 2010 at 1:53 PM, JOAQUIN PEREZ IGLESIAS < > >> > joaquin.pe...@lsi.uned.es> wrote: > >> > > >> >> By the way, > >> >> > >> >> I don't want to start a flame war VSM vs BM25, but I really believe > >> that > >> >> I > >> >> have to express my opinion as Robert has done. In my experience, I > >> have > >> >> never found a case where VSM improves significantly BM25. Maybe you > >> can > >> >> find some cases under some very specific collection characteristics, > >> (as > >> >> average length of 300 vs 3000) or a bad usage of BM25 (not proper > >> >> parameters) where it can happen. > >> >> > >> >> BM25 is not just only a different way of length normalization, it is > >> >> based > >> >> strongly in the probabilistic framework, and parametrises frequencies > >> >> and > >> >> length. This is probably the most successful ranking model of the > >> last > >> >> years in Information Retrieval. > >> >> > >> >> I have never read a paper where VSM improves any of the > >> >> state-of-the-art > >> >> ranking models (Language Models, DFR, BM25,...), although the VSM > >> with > >> >> pivoted normalisation length can obtain nice results. This can be > >> proved > >> >> checking the last years of the TREC competition. > >> >> > >> >> Honestly to say that is a myth that BM25 improves VSM breaks the last > >> 10 > >> >> or 15 years of research on Information Retrieval, and I really > >> believe > >> >> that is not accurate. > >> >> > >> >> The good thing of Information Retrieval is that you can always make > >> your > >> >> owns experiments and you can use the experience of a lot of years of > >> >> research. > >> >> > >> >> PS: This opinion is based on experiments on TREC and CLEF > >> collections, > >> >> obviously we can start a debate about the suitability of this type of > >> >> experimentation (concept of relevance, pooling, relevance > >> judgements), > >> >> but > >> >> this is a much more complex topic and I believe is far from what we > >> are > >> >> dealing here. > >> >> > >> >> PS2: In relation with TREC4 Cornell used a pivoted length > >> normalisation > >> >> and they were applying pseudo-relevance feedback, what honestly makes > >> >> much > >> >> more difficult the analysis of the results. Obviously their results > >> were > >> >> part of the pool. > >> >> > >> >> Sorry for the huge mail :-)))) > >> >> > >> >> > Hi Ivan, > >> >> > > >> >> > the problem is that unfortunately BM25 > >> >> > cannot be implemented overwriting > >> >> > the Similarity interface. Therefore BM25Similarity > >> >> > only computes the classic probabilistic IDF (what is > >> >> > interesting only at search time). > >> >> > If you set BM25Similarity at indexing time > >> >> > some basic stats are not stored > >> >> > correctly in the segments (like docs length). > >> >> > > >> >> > When you use BM25BooleanQuery this class > >> >> > will set automatically the BM25Similarity for you, > >> >> > therefore you don't need to do this explicitly. > >> >> > > >> >> > I tried to make this implementation with the focus on > >> >> > not interfering on the typical use of Lucene (so no changing > >> >> > DefaultSimilarity). > >> >> > > >> >> >> Joaquin, Robert, > >> >> >> > >> >> >> I followed Joaquin's recommendation and removed the call to set > >> >> >> similarity > >> >> >> to BM25 explicitly (indexer, searcher). The results showed 55% > >> >> >> improvement for the MAP score (0.141->0.219) over default > >> similarity. > >> >> >> > >> >> >> Joaquin, how would setting the similarity to BM25 explicitly make > >> the > >> >> >> score worse? > >> >> >> > >> >> >> Thank you, > >> >> >> > >> >> >> Ivan > >> >> >> > >> >> >> > >> >> >> > >> >> >> --- On Tue, 2/16/10, Robert Muir <rcm...@gmail.com> wrote: > >> >> >> > >> >> >>> From: Robert Muir <rcm...@gmail.com> > >> >> >>> Subject: Re: BM25 Scoring Patch > >> >> >>> To: java-user@lucene.apache.org > >> >> >>> Date: Tuesday, February 16, 2010, 11:36 AM > >> >> >>> yes Ivan, if possible please report > >> >> >>> back any findings you can on the > >> >> >>> experiments you are doing! > >> >> >>> > >> >> >>> On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez Iglesias > >> >> >>> < > >> >> >>> joaquin.pe...@lsi.uned.es> > >> >> >>> wrote: > >> >> >>> > >> >> >>> > Hi Ivan, > >> >> >>> > > >> >> >>> > You shouldn't set the BM25Similarity for indexing or > >> >> >>> searching. > >> >> >>> > Please try removing the lines: > >> >> >>> > writer.setSimilarity(new > >> >> >>> BM25Similarity()); > >> >> >>> > searcher.setSimilarity(sim); > >> >> >>> > > >> >> >>> > Please let us/me know if you improve your results with > >> >> >>> these changes. > >> >> >>> > > >> >> >>> > > >> >> >>> > Robert Muir escribió: > >> >> >>> > > >> >> >>> > Hi Ivan, I've seen many cases where BM25 > >> >> >>> performs worse than Lucene's > >> >> >>> >> default Similarity. Perhaps this is just another > >> >> >>> one? > >> >> >>> >> > >> >> >>> >> Again while I have not worked with this particular > >> >> >>> collection, I looked at > >> >> >>> >> the statistics and noted that its composed of > >> >> >>> several 'sub-collections': > >> >> >>> >> for > >> >> >>> >> example the PAT documents on disk 3 have an > >> >> >>> average doc length of 3543, > >> >> >>> >> but > >> >> >>> >> the AP documents on disk 1 have an avg doc length > >> >> >>> of 353. > >> >> >>> >> > >> >> >>> >> I have found on other collections that any > >> >> >>> advantages of BM25's document > >> >> >>> >> length normalization fall apart when 'average > >> >> >>> document length' doesn't > >> >> >>> >> make > >> >> >>> >> a whole lot of sense (cases like this). > >> >> >>> >> > >> >> >>> >> For this same reason, I've only found a few > >> >> >>> collections where BM25's doc > >> >> >>> >> length normalization is really significantly > >> >> >>> better than Lucene's. > >> >> >>> >> > >> >> >>> >> In my opinion, the results on a particular test > >> >> >>> collection or 2 have > >> >> >>> >> perhaps > >> >> >>> >> been taken too far and created a myth that BM25 is > >> >> >>> always superior to > >> >> >>> >> Lucene's scoring... this is not true! > >> >> >>> >> > >> >> >>> >> On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov > >> >> >>> <iprov...@yahoo.com> > >> >> >>> >> wrote: > >> >> >>> >> > >> >> >>> >> I applied the Lucene patch mentioned in > >> >> >>> >>> https://issues.apache.org/jira/browse/LUCENE-2091 and > >> >> >>> ran the MAP > >> >> >>> >>> numbers > >> >> >>> >>> on TREC-3 collection using topics > >> >> >>> 151-200. I am not getting worse > >> >> >>> >>> results > >> >> >>> >>> comparing to Lucene DefaultSimilarity. I > >> >> >>> suspect, I am not using it > >> >> >>> >>> correctly. I have single field > >> >> >>> documents. This is the process I use: > >> >> >>> >>> > >> >> >>> >>> 1. During the indexing, I am setting the > >> >> >>> similarity to BM25 as such: > >> >> >>> >>> > >> >> >>> >>> IndexWriter writer = new IndexWriter(dir, new > >> >> >>> StandardAnalyzer( > >> >> >>> >>> > >> >> >>> Version.LUCENE_CURRENT), true, > >> >> >>> >>> > >> >> >>> IndexWriter.MaxFieldLength.UNLIMITED); > >> >> >>> >>> writer.setSimilarity(new BM25Similarity()); > >> >> >>> >>> > >> >> >>> >>> 2. During the Precision/Recall measurements, I > >> >> >>> am using a > >> >> >>> >>> SimpleBM25QQParser extension I added to the > >> >> >>> benchmark: > >> >> >>> >>> > >> >> >>> >>> QualityQueryParser qqParser = new > >> >> >>> SimpleBM25QQParser("title", "TEXT"); > >> >> >>> >>> > >> >> >>> >>> > >> >> >>> >>> 3. Here is the parser code (I set an avg doc > >> >> >>> length here): > >> >> >>> >>> > >> >> >>> >>> public Query parse(QualityQuery qq) throws > >> >> >>> ParseException { > >> >> >>> >>> BM25Parameters.setAverageLength(indexField, > >> >> >>> 798.30f);//avg doc length > >> >> >>> >>> BM25Parameters.setB(0.5f);//tried > >> >> >>> default values > >> >> >>> >>> BM25Parameters.setK1(2f); > >> >> >>> >>> return query = new > >> >> >>> BM25BooleanQuery(qq.getValue(qqName), indexField, > >> >> >>> >>> new > >> >> >>> >>> StandardAnalyzer(Version.LUCENE_CURRENT)); > >> >> >>> >>> } > >> >> >>> >>> > >> >> >>> >>> 4. The searcher is using BM25 similarity: > >> >> >>> >>> > >> >> >>> >>> Searcher searcher = new IndexSearcher(dir, > >> >> >>> true); > >> >> >>> >>> searcher.setSimilarity(sim); > >> >> >>> >>> > >> >> >>> >>> Am I missing some steps? Does anyone > >> >> >>> have experience with this code? > >> >> >>> >>> > >> >> >>> >>> Thanks, > >> >> >>> >>> > >> >> >>> >>> Ivan > >> >> >>> >>> > >> >> >>> >>> > >> >> >>> >>> > >> >> >>> >>> > >> >> >>> >>> > >> >> >>> > >> --------------------------------------------------------------------- > >> >> >>> >>> To unsubscribe, e-mail: > >> java-user-unsubscr...@lucene.apache.org > >> >> >>> >>> For additional commands, e-mail: > >> >> java-user-h...@lucene.apache.org > >> >> >>> >>> > >> >> >>> >>> > >> >> >>> >>> > >> >> >>> >> > >> >> >>> >> > >> >> >>> > -- > >> >> >>> > > >> >> >>> ----------------------------------------------------------- > >> >> >>> > Joaquín Pérez Iglesias > >> >> >>> > Dpto. Lenguajes y Sistemas Informáticos > >> >> >>> > E.T.S.I. Informática (UNED) > >> >> >>> > Ciudad Universitaria > >> >> >>> > C/ Juan del Rosal nº 16 > >> >> >>> > 28040 Madrid - Spain > >> >> >>> > Phone. +34 91 398 89 19 > >> >> >>> > Fax +34 91 398 65 35 > >> >> >>> > Office 2.11 > >> >> >>> > Email: joaquin.pe...@lsi.uned.es > >> >> >>> > web: > >> http://nlp.uned.es/~jperezi/ <http://nlp.uned.es/%7Ejperezi/>< > http://nlp.uned.es/%7Ejperezi/> > >> >> <http://nlp.uned.es/%7Ejperezi/>< > >> >> http://nlp.uned.es/%7Ejperezi/> > >> >> >>> > > >> >> >>> ----------------------------------------------------------- > >> >> >>> > > >> >> >>> > > >> >> >>> > > >> >> >>> > >> --------------------------------------------------------------------- > >> >> >>> > To unsubscribe, e-mail: > java-user-unsubscr...@lucene.apache.org > >> >> >>> > For additional commands, e-mail: > >> java-user-h...@lucene.apache.org > >> >> >>> > > >> >> >>> > > >> >> >>> > >> >> >>> > >> >> >>> -- > >> >> >>> Robert Muir > >> >> >>> rcm...@gmail.com > >> >> >>> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > --------------------------------------------------------------------- > >> >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> >> >> > >> >> >> > >> >> > > >> >> > > >> >> > > >> >> > > --------------------------------------------------------------------- > >> >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> >> > For additional commands, e-mail: java-user-h...@lucene.apache.org > >> >> > > >> >> > > >> >> > >> >> > >> >> > >> >> --------------------------------------------------------------------- > >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> >> > >> >> > >> > > >> > > >> > -- > >> > Robert Muir > >> > rcm...@gmail.com > >> > > >> > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > > > > > > -- > > Robert Muir > > rcm...@gmail.com > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com