Re: de-boosting fields

Erick Erickson Fri, 08 Dec 2006 18:06:48 -0800

I've certainly seen references to writing custom scorers, so it's possible.
you might find valuable hints by searching the mail archive. I'll leave it
to the more expert folks to suggest which is your best option.


Although (and I'm talking beyond my competence here), it *may* work for you
to assemble a Filter for the category part of your query and use that
instead of including the category in your query. As I understand it, filters
don't contribute (or all contribute identically) to the score, leaving the
search you're doing on body to determine your relevance, which seems like
what you're after. Filters even work with something called a
ConstantScoreQuery as I remember, which is a hint <G>.

But again, don't be surprised if one of the more expert folks comes up with
a *much* better idea <G>

Best
Erick



On 12/8/06, Scott Smith <[EMAIL PROTECTED]> wrote:


I have a collection of documents for which I've always returned the
results sorted on the date/time of the document (using a sort object in
the search method on my Searcher).  It works great.



Suddenly, I have a requirement to return the documents in relevancy
order.  So, that's easy (I thought); simply call search() without a sort
object.  Unfortunately, the results I got were not what I expected.  So,
I added some code to have lucene explain how it was getting the score
and then things became clearer.



Each document has all of the words in the document indexed in a field
called "Body" (vanilla unstored, indexed field).  However, there is also
some category information which is kept in a keyword field called
"Category".  A document may belong to a large number of categories
(10-70).



When I search, I generate a query which says "give me all of the
documents, in relevancy order, which contain one or more of the
following words: word1, word2, word 3-and it also must be in at least
one of the following categories: category1, category2, ..., categoryN.



What I found was that lucene was using the category information as part
of what it uses to compute the relevancy score (in hindsight, not too
surprising).  The problem is that the numbers from the category hits in
"Category" overwhelm the numbers from the word hits in the "Body".  So,
my most relevant document may only have a single word hit and a document
way down in the list (in terms of relevancy) might have a number of word
hits.  For example, in one search, the top scoring document scored
.2650.  Of that, the category information contributed .2635 to that
score-meaning the word hits only contributed .0015 to the relevancy.
This is the opposite of what I want.



I'd be happy to simply eliminate the category information from the score
computation all together (base relevancy scores only on the words which
hit in the "Body" field).  Another solution would be to change the boost
on the category information to some small number (zero?) or raise the
Body field boost to a much larger number or both.



What is the best way to do this?  Is changing the boost the right
answer?  Can a field's boost be zero?  Is there a way to write a custom
scorer that gets inserted somewhere?  Any suggestions would be
appreciated.



Scott

Re: de-boosting fields

Reply via email to