I seems to be a good solution.
However, I think it may takes some processing time to get the
distribution of all matching documents before scoring each docs.
Would you have a good idea to get the distributions less than some
reasonable time?
On 2009. 05. 26, at 오후 8:15, Joel Halbert wrote:
Yes, something like this might work, although rather than having a
cutoff determined by the difference between two successive document
scores (Doc(n) and Doc(n-1)) I was thinking of using a function which
looked at the distribution of the scores of all matching documents.
Since I just want to exclude outliers it might be a simple case of
dropping those which have a score of less than -x standard
deviations or
more. The more the graph was positively skewed the less confidence we
would have in those documents in the tail.
Since such a function would be a function of all documents the
ordering of docs in a collector would not be relevant.
The main purpose for applying such a filter was the need to allow
users to pivot the search results by some field other than the
natural ordering by score.
In this case we only want to show the most relevant results.
-----Original Message-----
From: Babak Farhang <farh...@gmail.com>
Reply-To: java-user@lucene.apache.org
To: java-user@lucene.apache.org
Subject: Re: relevance function for scores
Date: Mon, 25 May 2009 16:11:32 -0600
Woops. Got that backwards.. should read
if (score[n] / score[n-1]) < c / (boost_factor)
On Mon, May 25, 2009 at 4:10 PM, Babak Farhang <farh...@gmail.com>
wrote:
How about determining the cutoff by measuring the percentage
difference between successive scores: if the score drops by a
threshold amount then you've hit the cutoff. In the example you
mention, you might want to try something like c/1000, where 1 < c <
25
is a constant (experiment to find a sweet spot for c).
I.e. something like
if (score[n-1] / score[n) < c / (boost_factor) ,
then you've reached your cutoff at the n-1th hit
(where boost_factor=1000 in your example).
One thing to check is that the scores are indeed sorted in descending
order to begin with. For example, I don't think the hits in
TopDocCollector and its brethren are strictly ordered this way (no?).
-Babak
On Mon, May 18, 2009 at 6:52 AM, Joel Halbert
<j...@su3analytics.com> wrote:
Hi,
I'd like to apply a score filter. I realise that filtering by
absolute
(i.e. anything less than x) scores is pretty meaningless.
In my case I want to filter based on relative score - or on some
function of score which looks for clustering of documents around
certain
score values.
Context: I have set up field boosts such that a query hit on one
indexed
field will, in theory, result in a score one or more order of
magnitudes
greater than a hit on some other field. So if I have 2 fields A
and B
and I'm really really interested in hits on A, and only interested
in
hits on B if there were none on A, I boost A by 1000, relative to
B.
The resultant score should reflect this.
The ability to do this becomes important when we want to re-order
the
search results around some other field (not score) and are not
interested in displaying the least relevant documents.
It is an easy thing to write a basic 'document collector/result
filter'
that uses relative score information to filter out documents where
any
score is less than some magnitude of the best score, but I'm sure
this
could be more elegantly generalised into some mathematical
"relevance/significance" model/function which could determine some
optimal cutoff for documents based on the clustering of results
around
scores.
e.g. if my top 5 documents are all between score 0.9 and 0.7 and the
remaining 10 are less than 0.01 then we could sensibly take the
top 5
docs as most relevant.
Has anyone experience of doing such a thing?
Regards,
Joel
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org