Re: relevance function for scores

kenny kim Wed, 27 May 2009 03:19:17 -0700

I seems to be a good solution.

However, I think it may takes some processing time to get thedistribution of all matching documents before scoring each docs.

Would you have a good idea to get the distributions less than somereasonable time?



On 2009. 05. 26, at 오후 8:15, Joel Halbert wrote:

Yes, something like this might work, although rather than having a
cutoff determined by the difference between two successive document
scores (Doc(n) and Doc(n-1)) I was thinking of using a function which
looked at the distribution of the scores of all matching documents.
Since I just want to exclude outliers it might be a simple case of
dropping those which have a score of less than -x standarddeviations or
more. The more the graph was positively skewed the less confidence we
would have in those documents in the tail.
Since such a function would be a function of all documents theordering of docs in a collector would not be relevant.The main purpose for applying such a filter was the need to allowusers to pivot the search results by some field other than thenatural ordering by score.
In this case we only want to show the most relevant results.


-----Original Message-----
From: Babak Farhang <farh...@gmail.com>
Reply-To: java-user@lucene.apache.org
To: java-user@lucene.apache.org
Subject: Re: relevance function for scores
Date: Mon, 25 May 2009 16:11:32 -0600

Woops. Got that backwards.. should read
if (score[n]  / score[n-1])  < c / (boost_factor)
On Mon, May 25, 2009 at 4:10 PM, Babak Farhang <farh...@gmail.com>wrote:
How about determining the cutoff by measuring the percentage
difference between successive scores: if the score drops by a
threshold amount then you've hit the cutoff.  In the example you
mention, you might want to try something like c/1000, where 1 < c <25
is a constant (experiment to find a sweet spot for c).

I.e. something like

if (score[n-1]  / score[n)  < c / (boost_factor) ,

then you've reached your cutoff at the n-1th hit
(where boost_factor=1000 in your example).

One thing to check is that the scores are indeed sorted in descending
order to begin with.  For example, I don't think the hits in
TopDocCollector and its brethren are strictly ordered this way (no?).

-Babak
On Mon, May 18, 2009 at 6:52 AM, Joel Halbert<j...@su3analytics.com> wrote:
Hi,
I'd like to apply a score filter. I realise that filtering byabsolute
(i.e. anything less than x) scores is pretty meaningless.

In my case I want to filter based on relative score - or on some
function of score which looks for clustering of documents aroundcertain
score values.
Context: I have set up field boosts such that a query hit on oneindexedfield will, in theory, result in a score one or more order ofmagnitudesgreater than a hit on some other field. So if I have 2 fields Aand Band I'm really really interested in hits on A, and only interestedinhits on B if there were none on A, I boost A by 1000, relative toB.
The resultant score should reflect this.
The ability to do this becomes important when we want to re-orderthe
search results around some other field (not score) and are not
interested in displaying the least relevant documents.
It is an easy thing to write a basic 'document collector/resultfilter'that uses relative score information to filter out documents whereanyscore is less than some magnitude of the best score, but I'm surethis
could be more elegantly generalised into some mathematical
"relevance/significance" model/function  which could determine some
optimal cutoff for documents based on the clustering of resultsaround
scores.
e.g. if my top 5 documents are all between score 0.9 and 0.7 and the
remaining 10 are less than 0.01 then we could sensibly take thetop 5
docs as most relevant.

Has anyone experience of doing such a thing?


Regards,
Joel



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org







---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: relevance function for scores

Reply via email to