Re: relevance function for scores

Joel Halbert Wed, 27 May 2009 06:28:42 -0700

I'm not certain, without testing it.

I think you and I may have slightly orthogonal needs. From what I gather
you are looking to speed up your search time (by filtering out
irrelevant results), whereas I am simply looking to increase the
relevancy of the results presented to the users when they group (and
re-order) the results by some field other than the score (by removing
the least relevant results from view). In doing so I accept that the
time taken to fetch results may increase as compared to a vanilla
search.



-----Original Message-----
From: kenny kim <goalw...@snu.ac.kr>
Reply-To: java-user@lucene.apache.org
To: java-user@lucene.apache.org
Subject: Re: relevance function for scores
Date: Wed, 27 May 2009 19:18:39 +0900

I seems to be a good solution.
However, I think it may takes some processing time to get the  
distribution of all matching documents before scoring each docs.

Would you have a good idea to get the distributions less than some  
reasonable time?


On 2009. 05. 26, at 오후 8:15, Joel Halbert wrote:

> Yes, something like this might work, although rather than having a
> cutoff determined by the difference between two successive document
> scores (Doc(n) and Doc(n-1)) I was thinking of using a function which
> looked at the distribution of the scores of all matching documents.
> Since I just want to exclude outliers it might be a simple case of
> dropping those which have a score of less than -x standard  
> deviations or
> more. The more the graph was positively skewed the less confidence we
> would have in those documents in the tail.
>
> Since such a function would be a function of all documents the  
> ordering of docs in a collector would not be relevant.
> The main purpose for applying such a filter was the need to allow  
> users to pivot the search results by some field other than the  
> natural ordering by score.
> In this case we only want to show the most relevant results.
>
>
> -----Original Message-----
> From: Babak Farhang <farh...@gmail.com>
> Reply-To: java-user@lucene.apache.org
> To: java-user@lucene.apache.org
> Subject: Re: relevance function for scores
> Date: Mon, 25 May 2009 16:11:32 -0600
>
> Woops. Got that backwards.. should read
>
>> if (score[n]  / score[n-1])  < c / (boost_factor)
>
>
> On Mon, May 25, 2009 at 4:10 PM, Babak Farhang <farh...@gmail.com>  
> wrote:
>> How about determining the cutoff by measuring the percentage
>> difference between successive scores: if the score drops by a
>> threshold amount then you've hit the cutoff.  In the example you
>> mention, you might want to try something like c/1000, where 1 < c <  
>> 25
>> is a constant (experiment to find a sweet spot for c).
>>
>> I.e. something like
>>
>> if (score[n-1]  / score[n)  < c / (boost_factor) ,
>>
>> then you've reached your cutoff at the n-1th hit
>> (where boost_factor=1000 in your example).
>>
>> One thing to check is that the scores are indeed sorted in descending
>> order to begin with.  For example, I don't think the hits in
>> TopDocCollector and its brethren are strictly ordered this way (no?).
>>
>> -Babak
>>
>> On Mon, May 18, 2009 at 6:52 AM, Joel Halbert  
>> <j...@su3analytics.com> wrote:
>>> Hi,
>>>
>>> I'd like to apply a score filter. I realise that filtering by  
>>> absolute
>>> (i.e. anything less than x) scores is pretty meaningless.
>>>
>>> In my case I want to filter based on relative score - or on some
>>> function of score which looks for clustering of documents around  
>>> certain
>>> score values.
>>>
>>> Context: I have set up field boosts such that a query hit on one  
>>> indexed
>>> field will, in theory, result in a score one or more order of  
>>> magnitudes
>>> greater than a hit on some other field. So if I have 2 fields A  
>>> and B
>>> and I'm really really interested in hits on A, and only interested  
>>> in
>>> hits on B if there were none on A,  I boost A by 1000, relative to  
>>> B.
>>> The resultant score should reflect this.
>>>
>>> The ability to do this becomes important when we want to re-order  
>>> the
>>> search results around some other field (not score) and are not
>>> interested in displaying the least relevant documents.
>>>
>>>
>>> It is an easy thing to write a basic 'document collector/result  
>>> filter'
>>> that uses relative score information to filter out documents where  
>>> any
>>> score is less than some magnitude of the best score, but I'm sure  
>>> this
>>> could be more elegantly generalised into some mathematical
>>> "relevance/significance" model/function  which could determine some
>>> optimal cutoff for documents based on the clustering of results  
>>> around
>>> scores.
>>> e.g. if my top 5 documents are all between score 0.9 and 0.7 and the
>>> remaining 10 are less than 0.01 then we could sensibly take the  
>>> top 5
>>> docs as most relevant.
>>>
>>> Has anyone experience of doing such a thing?
>>>
>>>
>>> Regards,
>>> Joel
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
>
>






---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: relevance function for scores

Reply via email to