yes Ivan, if possible please report back any findings you can on the experiments you are doing!
On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez Iglesias < joaquin.pe...@lsi.uned.es> wrote: > Hi Ivan, > > You shouldn't set the BM25Similarity for indexing or searching. > Please try removing the lines: > writer.setSimilarity(new BM25Similarity()); > searcher.setSimilarity(sim); > > Please let us/me know if you improve your results with these changes. > > > Robert Muir escribió: > > Hi Ivan, I've seen many cases where BM25 performs worse than Lucene's >> default Similarity. Perhaps this is just another one? >> >> Again while I have not worked with this particular collection, I looked at >> the statistics and noted that its composed of several 'sub-collections': >> for >> example the PAT documents on disk 3 have an average doc length of 3543, >> but >> the AP documents on disk 1 have an avg doc length of 353. >> >> I have found on other collections that any advantages of BM25's document >> length normalization fall apart when 'average document length' doesn't >> make >> a whole lot of sense (cases like this). >> >> For this same reason, I've only found a few collections where BM25's doc >> length normalization is really significantly better than Lucene's. >> >> In my opinion, the results on a particular test collection or 2 have >> perhaps >> been taken too far and created a myth that BM25 is always superior to >> Lucene's scoring... this is not true! >> >> On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov <iprov...@yahoo.com> >> wrote: >> >> I applied the Lucene patch mentioned in >>> https://issues.apache.org/jira/browse/LUCENE-2091 and ran the MAP >>> numbers >>> on TREC-3 collection using topics 151-200. I am not getting worse >>> results >>> comparing to Lucene DefaultSimilarity. I suspect, I am not using it >>> correctly. I have single field documents. This is the process I use: >>> >>> 1. During the indexing, I am setting the similarity to BM25 as such: >>> >>> IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer( >>> Version.LUCENE_CURRENT), true, >>> IndexWriter.MaxFieldLength.UNLIMITED); >>> writer.setSimilarity(new BM25Similarity()); >>> >>> 2. During the Precision/Recall measurements, I am using a >>> SimpleBM25QQParser extension I added to the benchmark: >>> >>> QualityQueryParser qqParser = new SimpleBM25QQParser("title", "TEXT"); >>> >>> >>> 3. Here is the parser code (I set an avg doc length here): >>> >>> public Query parse(QualityQuery qq) throws ParseException { >>> BM25Parameters.setAverageLength(indexField, 798.30f);//avg doc length >>> BM25Parameters.setB(0.5f);//tried default values >>> BM25Parameters.setK1(2f); >>> return query = new BM25BooleanQuery(qq.getValue(qqName), indexField, >>> new >>> StandardAnalyzer(Version.LUCENE_CURRENT)); >>> } >>> >>> 4. The searcher is using BM25 similarity: >>> >>> Searcher searcher = new IndexSearcher(dir, true); >>> searcher.setSimilarity(sim); >>> >>> Am I missing some steps? Does anyone have experience with this code? >>> >>> Thanks, >>> >>> Ivan >>> >>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >>> >> >> > -- > ----------------------------------------------------------- > Joaquín Pérez Iglesias > Dpto. Lenguajes y Sistemas Informáticos > E.T.S.I. Informática (UNED) > Ciudad Universitaria > C/ Juan del Rosal nº 16 > 28040 Madrid - Spain > Phone. +34 91 398 89 19 > Fax +34 91 398 65 35 > Office 2.11 > Email: joaquin.pe...@lsi.uned.es > web: http://nlp.uned.es/~jperezi/ <http://nlp.uned.es/%7Ejperezi/> > ----------------------------------------------------------- > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com