Hi Erick, Thanks for the pointer. Sorry if the question was a bit unclear but basically I'm looking to see if anyone has any pointers on the actual mathematical functions or models to use (rather than the implementation). I'd be really interested to hear what others have used to solve this - since ideally I'd like a cutoff point optimised to the resultant score values.
J -----Original Message----- From: Erick Erickson <erickerick...@gmail.com> Reply-To: java-user@lucene.apache.org To: java-user@lucene.apache.org Subject: Re: relevance function for scores Date: Mon, 18 May 2009 09:13:27 -0400 Have you looked at TopDocCollector? Basically, you can tell itto only return you the top N docs by score (N is arbitrary). What you then have is an array of raw score and doc ID pairs AND a max score. NOTE: "raw score" is not normalized, i.e. is not guaranteed to be between 0 and 1. So now you can examine the scores and put them in buckets any way you want, all you're doing is spinning through a small data structure performing some calculations..... HTH Erick On Mon, May 18, 2009 at 8:52 AM, Joel Halbert <j...@su3analytics.com> wrote: > Hi, > > I'd like to apply a score filter. I realise that filtering by absolute > (i.e. anything less than x) scores is pretty meaningless. > > In my case I want to filter based on relative score - or on some > function of score which looks for clustering of documents around certain > score values. > > Context: I have set up field boosts such that a query hit on one indexed > field will, in theory, result in a score one or more order of magnitudes > greater than a hit on some other field. So if I have 2 fields A and B > and I'm really really interested in hits on A, and only interested in > hits on B if there were none on A, I boost A by 1000, relative to B. > The resultant score should reflect this. > > The ability to do this becomes important when we want to re-order the > search results around some other field (not score) and are not > interested in displaying the least relevant documents. > > > It is an easy thing to write a basic 'document collector/result filter' > that uses relative score information to filter out documents where any > score is less than some magnitude of the best score, but I'm sure this > could be more elegantly generalised into some mathematical > "relevance/significance" model/function which could determine some > optimal cutoff for documents based on the clustering of results around > scores. > e.g. if my top 5 documents are all between score 0.9 and 0.7 and the > remaining 10 are less than 0.01 then we could sensibly take the top 5 > docs as most relevant. > > Has anyone experience of doing such a thing? > > > Regards, > Joel > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org