Woops. Got that backwards.. should read > if (score[n] / score[n-1]) < c / (boost_factor)
On Mon, May 25, 2009 at 4:10 PM, Babak Farhang <farh...@gmail.com> wrote: > How about determining the cutoff by measuring the percentage > difference between successive scores: if the score drops by a > threshold amount then you've hit the cutoff. In the example you > mention, you might want to try something like c/1000, where 1 < c < 25 > is a constant (experiment to find a sweet spot for c). > > I.e. something like > > if (score[n-1] / score[n) < c / (boost_factor) , > > then you've reached your cutoff at the n-1th hit > (where boost_factor=1000 in your example). > > One thing to check is that the scores are indeed sorted in descending > order to begin with. For example, I don't think the hits in > TopDocCollector and its brethren are strictly ordered this way (no?). > > -Babak > > On Mon, May 18, 2009 at 6:52 AM, Joel Halbert <j...@su3analytics.com> wrote: >> Hi, >> >> I'd like to apply a score filter. I realise that filtering by absolute >> (i.e. anything less than x) scores is pretty meaningless. >> >> In my case I want to filter based on relative score - or on some >> function of score which looks for clustering of documents around certain >> score values. >> >> Context: I have set up field boosts such that a query hit on one indexed >> field will, in theory, result in a score one or more order of magnitudes >> greater than a hit on some other field. So if I have 2 fields A and B >> and I'm really really interested in hits on A, and only interested in >> hits on B if there were none on A, I boost A by 1000, relative to B. >> The resultant score should reflect this. >> >> The ability to do this becomes important when we want to re-order the >> search results around some other field (not score) and are not >> interested in displaying the least relevant documents. >> >> >> It is an easy thing to write a basic 'document collector/result filter' >> that uses relative score information to filter out documents where any >> score is less than some magnitude of the best score, but I'm sure this >> could be more elegantly generalised into some mathematical >> "relevance/significance" model/function which could determine some >> optimal cutoff for documents based on the clustering of results around >> scores. >> e.g. if my top 5 documents are all between score 0.9 and 0.7 and the >> remaining 10 are less than 0.01 then we could sensibly take the top 5 >> docs as most relevant. >> >> Has anyone experience of doing such a thing? >> >> >> Regards, >> Joel >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org