I don't think its really a competition, I think preferably we should have the flexibility to change the scoring model in lucene actually?
I have found lots of cases where VSM improves on BM25, but then again I don't work with TREC stuff, as I work with non-english collections. It doesn't contradict years of research to say that VSM isn't a state-of-the-art model, besides the TREC-4 results, there are CLEF results where VSM models perform competitively or exceed (Finnish, Russian, etc) BM25/DFR/etc. It depends on the collection, there isn't a 'best retrieval formula'. Note: I have no bias against BM-25, but its definitely a myth to say there is a single retrieval formula that is the 'best' across the board. On Tue, Feb 16, 2010 at 1:53 PM, JOAQUIN PEREZ IGLESIAS < joaquin.pe...@lsi.uned.es> wrote: > By the way, > > I don't want to start a flame war VSM vs BM25, but I really believe that I > have to express my opinion as Robert has done. In my experience, I have > never found a case where VSM improves significantly BM25. Maybe you can > find some cases under some very specific collection characteristics, (as > average length of 300 vs 3000) or a bad usage of BM25 (not proper > parameters) where it can happen. > > BM25 is not just only a different way of length normalization, it is based > strongly in the probabilistic framework, and parametrises frequencies and > length. This is probably the most successful ranking model of the last > years in Information Retrieval. > > I have never read a paper where VSM improves any of the state-of-the-art > ranking models (Language Models, DFR, BM25,...), although the VSM with > pivoted normalisation length can obtain nice results. This can be proved > checking the last years of the TREC competition. > > Honestly to say that is a myth that BM25 improves VSM breaks the last 10 > or 15 years of research on Information Retrieval, and I really believe > that is not accurate. > > The good thing of Information Retrieval is that you can always make your > owns experiments and you can use the experience of a lot of years of > research. > > PS: This opinion is based on experiments on TREC and CLEF collections, > obviously we can start a debate about the suitability of this type of > experimentation (concept of relevance, pooling, relevance judgements), but > this is a much more complex topic and I believe is far from what we are > dealing here. > > PS2: In relation with TREC4 Cornell used a pivoted length normalisation > and they were applying pseudo-relevance feedback, what honestly makes much > more difficult the analysis of the results. Obviously their results were > part of the pool. > > Sorry for the huge mail :-)))) > > > Hi Ivan, > > > > the problem is that unfortunately BM25 > > cannot be implemented overwriting > > the Similarity interface. Therefore BM25Similarity > > only computes the classic probabilistic IDF (what is > > interesting only at search time). > > If you set BM25Similarity at indexing time > > some basic stats are not stored > > correctly in the segments (like docs length). > > > > When you use BM25BooleanQuery this class > > will set automatically the BM25Similarity for you, > > therefore you don't need to do this explicitly. > > > > I tried to make this implementation with the focus on > > not interfering on the typical use of Lucene (so no changing > > DefaultSimilarity). > > > >> Joaquin, Robert, > >> > >> I followed Joaquin's recommendation and removed the call to set > >> similarity > >> to BM25 explicitly (indexer, searcher). The results showed 55% > >> improvement for the MAP score (0.141->0.219) over default similarity. > >> > >> Joaquin, how would setting the similarity to BM25 explicitly make the > >> score worse? > >> > >> Thank you, > >> > >> Ivan > >> > >> > >> > >> --- On Tue, 2/16/10, Robert Muir <rcm...@gmail.com> wrote: > >> > >>> From: Robert Muir <rcm...@gmail.com> > >>> Subject: Re: BM25 Scoring Patch > >>> To: java-user@lucene.apache.org > >>> Date: Tuesday, February 16, 2010, 11:36 AM > >>> yes Ivan, if possible please report > >>> back any findings you can on the > >>> experiments you are doing! > >>> > >>> On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez Iglesias > >>> < > >>> joaquin.pe...@lsi.uned.es> > >>> wrote: > >>> > >>> > Hi Ivan, > >>> > > >>> > You shouldn't set the BM25Similarity for indexing or > >>> searching. > >>> > Please try removing the lines: > >>> > writer.setSimilarity(new > >>> BM25Similarity()); > >>> > searcher.setSimilarity(sim); > >>> > > >>> > Please let us/me know if you improve your results with > >>> these changes. > >>> > > >>> > > >>> > Robert Muir escribió: > >>> > > >>> > Hi Ivan, I've seen many cases where BM25 > >>> performs worse than Lucene's > >>> >> default Similarity. Perhaps this is just another > >>> one? > >>> >> > >>> >> Again while I have not worked with this particular > >>> collection, I looked at > >>> >> the statistics and noted that its composed of > >>> several 'sub-collections': > >>> >> for > >>> >> example the PAT documents on disk 3 have an > >>> average doc length of 3543, > >>> >> but > >>> >> the AP documents on disk 1 have an avg doc length > >>> of 353. > >>> >> > >>> >> I have found on other collections that any > >>> advantages of BM25's document > >>> >> length normalization fall apart when 'average > >>> document length' doesn't > >>> >> make > >>> >> a whole lot of sense (cases like this). > >>> >> > >>> >> For this same reason, I've only found a few > >>> collections where BM25's doc > >>> >> length normalization is really significantly > >>> better than Lucene's. > >>> >> > >>> >> In my opinion, the results on a particular test > >>> collection or 2 have > >>> >> perhaps > >>> >> been taken too far and created a myth that BM25 is > >>> always superior to > >>> >> Lucene's scoring... this is not true! > >>> >> > >>> >> On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov > >>> <iprov...@yahoo.com> > >>> >> wrote: > >>> >> > >>> >> I applied the Lucene patch mentioned in > >>> >>> https://issues.apache.org/jira/browse/LUCENE-2091 and > >>> ran the MAP > >>> >>> numbers > >>> >>> on TREC-3 collection using topics > >>> 151-200. I am not getting worse > >>> >>> results > >>> >>> comparing to Lucene DefaultSimilarity. I > >>> suspect, I am not using it > >>> >>> correctly. I have single field > >>> documents. This is the process I use: > >>> >>> > >>> >>> 1. During the indexing, I am setting the > >>> similarity to BM25 as such: > >>> >>> > >>> >>> IndexWriter writer = new IndexWriter(dir, new > >>> StandardAnalyzer( > >>> >>> > >>> Version.LUCENE_CURRENT), true, > >>> >>> > >>> IndexWriter.MaxFieldLength.UNLIMITED); > >>> >>> writer.setSimilarity(new BM25Similarity()); > >>> >>> > >>> >>> 2. During the Precision/Recall measurements, I > >>> am using a > >>> >>> SimpleBM25QQParser extension I added to the > >>> benchmark: > >>> >>> > >>> >>> QualityQueryParser qqParser = new > >>> SimpleBM25QQParser("title", "TEXT"); > >>> >>> > >>> >>> > >>> >>> 3. Here is the parser code (I set an avg doc > >>> length here): > >>> >>> > >>> >>> public Query parse(QualityQuery qq) throws > >>> ParseException { > >>> >>> BM25Parameters.setAverageLength(indexField, > >>> 798.30f);//avg doc length > >>> >>> BM25Parameters.setB(0.5f);//tried > >>> default values > >>> >>> BM25Parameters.setK1(2f); > >>> >>> return query = new > >>> BM25BooleanQuery(qq.getValue(qqName), indexField, > >>> >>> new > >>> >>> StandardAnalyzer(Version.LUCENE_CURRENT)); > >>> >>> } > >>> >>> > >>> >>> 4. The searcher is using BM25 similarity: > >>> >>> > >>> >>> Searcher searcher = new IndexSearcher(dir, > >>> true); > >>> >>> searcher.setSimilarity(sim); > >>> >>> > >>> >>> Am I missing some steps? Does anyone > >>> have experience with this code? > >>> >>> > >>> >>> Thanks, > >>> >>> > >>> >>> Ivan > >>> >>> > >>> >>> > >>> >>> > >>> >>> > >>> >>> > >>> --------------------------------------------------------------------- > >>> >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >>> >>> For additional commands, e-mail: java-user-h...@lucene.apache.org > >>> >>> > >>> >>> > >>> >>> > >>> >> > >>> >> > >>> > -- > >>> > > >>> ----------------------------------------------------------- > >>> > Joaquín Pérez Iglesias > >>> > Dpto. Lenguajes y Sistemas Informáticos > >>> > E.T.S.I. Informática (UNED) > >>> > Ciudad Universitaria > >>> > C/ Juan del Rosal nº 16 > >>> > 28040 Madrid - Spain > >>> > Phone. +34 91 398 89 19 > >>> > Fax +34 91 398 65 35 > >>> > Office 2.11 > >>> > Email: joaquin.pe...@lsi.uned.es > >>> > web: http://nlp.uned.es/~jperezi/ <http://nlp.uned.es/%7Ejperezi/>< > http://nlp.uned.es/%7Ejperezi/> > >>> > > >>> ----------------------------------------------------------- > >>> > > >>> > > >>> > > >>> --------------------------------------------------------------------- > >>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >>> > For additional commands, e-mail: java-user-h...@lucene.apache.org > >>> > > >>> > > >>> > >>> > >>> -- > >>> Robert Muir > >>> rcm...@gmail.com > >>> > >> > >> > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com