I'm not certain, without testing it. I think you and I may have slightly orthogonal needs. From what I gather you are looking to speed up your search time (by filtering out irrelevant results), whereas I am simply looking to increase the relevancy of the results presented to the users when they group (and re-order) the results by some field other than the score (by removing the least relevant results from view). In doing so I accept that the time taken to fetch results may increase as compared to a vanilla search.
-----Original Message----- From: kenny kim <goalw...@snu.ac.kr> Reply-To: java-user@lucene.apache.org To: java-user@lucene.apache.org Subject: Re: relevance function for scores Date: Wed, 27 May 2009 19:18:39 +0900 I seems to be a good solution. However, I think it may takes some processing time to get the distribution of all matching documents before scoring each docs. Would you have a good idea to get the distributions less than some reasonable time? On 2009. 05. 26, at 오후 8:15, Joel Halbert wrote: > Yes, something like this might work, although rather than having a > cutoff determined by the difference between two successive document > scores (Doc(n) and Doc(n-1)) I was thinking of using a function which > looked at the distribution of the scores of all matching documents. > Since I just want to exclude outliers it might be a simple case of > dropping those which have a score of less than -x standard > deviations or > more. The more the graph was positively skewed the less confidence we > would have in those documents in the tail. > > Since such a function would be a function of all documents the > ordering of docs in a collector would not be relevant. > The main purpose for applying such a filter was the need to allow > users to pivot the search results by some field other than the > natural ordering by score. > In this case we only want to show the most relevant results. > > > -----Original Message----- > From: Babak Farhang <farh...@gmail.com> > Reply-To: java-user@lucene.apache.org > To: java-user@lucene.apache.org > Subject: Re: relevance function for scores > Date: Mon, 25 May 2009 16:11:32 -0600 > > Woops. Got that backwards.. should read > >> if (score[n] / score[n-1]) < c / (boost_factor) > > > On Mon, May 25, 2009 at 4:10 PM, Babak Farhang <farh...@gmail.com> > wrote: >> How about determining the cutoff by measuring the percentage >> difference between successive scores: if the score drops by a >> threshold amount then you've hit the cutoff. In the example you >> mention, you might want to try something like c/1000, where 1 < c < >> 25 >> is a constant (experiment to find a sweet spot for c). >> >> I.e. something like >> >> if (score[n-1] / score[n) < c / (boost_factor) , >> >> then you've reached your cutoff at the n-1th hit >> (where boost_factor=1000 in your example). >> >> One thing to check is that the scores are indeed sorted in descending >> order to begin with. For example, I don't think the hits in >> TopDocCollector and its brethren are strictly ordered this way (no?). >> >> -Babak >> >> On Mon, May 18, 2009 at 6:52 AM, Joel Halbert >> <j...@su3analytics.com> wrote: >>> Hi, >>> >>> I'd like to apply a score filter. I realise that filtering by >>> absolute >>> (i.e. anything less than x) scores is pretty meaningless. >>> >>> In my case I want to filter based on relative score - or on some >>> function of score which looks for clustering of documents around >>> certain >>> score values. >>> >>> Context: I have set up field boosts such that a query hit on one >>> indexed >>> field will, in theory, result in a score one or more order of >>> magnitudes >>> greater than a hit on some other field. So if I have 2 fields A >>> and B >>> and I'm really really interested in hits on A, and only interested >>> in >>> hits on B if there were none on A, I boost A by 1000, relative to >>> B. >>> The resultant score should reflect this. >>> >>> The ability to do this becomes important when we want to re-order >>> the >>> search results around some other field (not score) and are not >>> interested in displaying the least relevant documents. >>> >>> >>> It is an easy thing to write a basic 'document collector/result >>> filter' >>> that uses relative score information to filter out documents where >>> any >>> score is less than some magnitude of the best score, but I'm sure >>> this >>> could be more elegantly generalised into some mathematical >>> "relevance/significance" model/function which could determine some >>> optimal cutoff for documents based on the clustering of results >>> around >>> scores. >>> e.g. if my top 5 documents are all between score 0.9 and 0.7 and the >>> remaining 10 are less than 0.01 then we could sensibly take the >>> top 5 >>> docs as most relevant. >>> >>> Has anyone experience of doing such a thing? >>> >>> >>> Regards, >>> Joel >>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org