We are looking into making some improvements to relevance ranking of our search platform based on Lucene. We started by running the Ad Hoc TREC task on the TREC-3 data using "out-of-the-box" Lucene. The reason to run this old TREC-3 (TIPSTER Disk 1 and Disk 2; topics 151-200) data was that the content is matching the content of our production system.
We are currently getting average precision of 0.14. We found some format issues with the TREC-3 data which were causing even lower score. For example, the initial average precision number was 0.9. We discovered that the topics included the word "Topic:" in the <title> tag. For example, "<title> Topic: Coping with overcrowded prisons". By removing this term from the queries, we bumped the average precision to 0.14. Our query is based on the title tag of the topic and the index field is based on the <TEXT> tag of the document. QualityQueryParser qqParser = new SimpleQQParser("title", "TEXT"); Is there an average precision number which "out-of-the-box" Lucene should be close to? For example, this IBM's 2007 TREC paper mentions 0.154: http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf Thank you, Ivan --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org