Relevancy judgement lists ARE very context sensitive. For example, in a medical search application you'll have very different relevancy requirements between a point-of-care applications vs an application being used to perform general "sit at your desk" research ***even if the content being served is identical*
Point-of-care is about getting to a solution fast. Its targeted. Recency may be more of a factor. Specific solutions to medical problems may be more important. Sit-at-your-desk Research may be more about futzing around with general knowledge and more about the "discovery" aspect of search. Even IF the data sets for the two applications were 100% identical, you would almost certainly provide different relevancy rules based on the different use cases. We do a lot of testing with judgement lists (mostly through our product Quepid <http://quepid.com> but there are other home-grown scripted tools people use too). Judgement lists are great for collaborating closely with your client on what you expect search to do -- ie capture informal use cases. It lets them make assertions about what the correct order search results should be. This allows you to optimize for a reasonable set of use cases. We've had it work well as long as they're representative in nature. For example, in a name search application you don't need a "D. Turnbull" and a "Y. Seeley" to test the case of "first initial/last name" search. You often just need one exemplar to test/work against to prove you've solved (and continue to solve) that problem. Judgement lists based on "experts" tend to break down occasionally when the person you're collaborating with does not actually reflect the actual behavior of users. So we'll also work on relevancy in the context of judgement lists generated programatically through user behavior (ie query logs) not just what the expert is. More integration work, and requires more data, but potentially more beneficial for relevancy tuning. We blog a fair amount about relevancy preproduction and regression testing. You can read more here <http://www.opensourceconnections.com/2013/10/21/search-quality-is-about-effective-collaboration/> , here <http://www.opensourceconnections.com/blog/2014/06/10/what-is-search-relevancy/>, and here <http://www.opensourceconnections.com/2013/10/14/what-is-test-driven-search-relevancy/>. Hope its helpful to you. Good luck -Doug Search Relevancy Consultant OpenSource Connections On Thu, Jun 12, 2014 at 1:47 PM, Ivan Brusic <i...@brusic.com> wrote: > Perhaps more of an NLP question, but are there any tests regarding > relevance for Lucene? Given an example corpus of documents, what are the > golden sets for specific queries? The Wikidump dump is used as a > benchmarking tool for both indexing and querying in Lucene, but there are > no metrics in terms of precision. > > The Open Relevance project was closed yesterday ( > http://lucene.apache.org/openrelevance/), which is what prompted me to ask > this question. Was the sub-project closed because others have found > alternate solutions? > > Relevancy is of course extremely context-dependent and objective, but my > hope is that there is an example catalog somewhere with defined golden > sets. > > Cheers, > > Ivan > -- Doug Turnbull Search & Big Data Architect OpenSource Connections <http://o19s.com>