I've certainly seen references to writing custom scorers, so it's possible. you might find valuable hints by searching the mail archive. I'll leave it to the more expert folks to suggest which is your best option.
Although (and I'm talking beyond my competence here), it *may* work for you to assemble a Filter for the category part of your query and use that instead of including the category in your query. As I understand it, filters don't contribute (or all contribute identically) to the score, leaving the search you're doing on body to determine your relevance, which seems like what you're after. Filters even work with something called a ConstantScoreQuery as I remember, which is a hint <G>. But again, don't be surprised if one of the more expert folks comes up with a *much* better idea <G> Best Erick On 12/8/06, Scott Smith <[EMAIL PROTECTED]> wrote:
I have a collection of documents for which I've always returned the results sorted on the date/time of the document (using a sort object in the search method on my Searcher). It works great. Suddenly, I have a requirement to return the documents in relevancy order. So, that's easy (I thought); simply call search() without a sort object. Unfortunately, the results I got were not what I expected. So, I added some code to have lucene explain how it was getting the score and then things became clearer. Each document has all of the words in the document indexed in a field called "Body" (vanilla unstored, indexed field). However, there is also some category information which is kept in a keyword field called "Category". A document may belong to a large number of categories (10-70). When I search, I generate a query which says "give me all of the documents, in relevancy order, which contain one or more of the following words: word1, word2, word 3-and it also must be in at least one of the following categories: category1, category2, ..., categoryN. What I found was that lucene was using the category information as part of what it uses to compute the relevancy score (in hindsight, not too surprising). The problem is that the numbers from the category hits in "Category" overwhelm the numbers from the word hits in the "Body". So, my most relevant document may only have a single word hit and a document way down in the list (in terms of relevancy) might have a number of word hits. For example, in one search, the top scoring document scored .2650. Of that, the category information contributed .2635 to that score-meaning the word hits only contributed .0015 to the relevancy. This is the opposite of what I want. I'd be happy to simply eliminate the category information from the score computation all together (base relevancy scores only on the words which hit in the "Body" field). Another solution would be to change the boost on the category information to some small number (zero?) or raise the Body field boost to a much larger number or both. What is the best way to do this? Is changing the boost the right answer? Can a field's boost be zero? Is there a way to write a custom scorer that gets inserted somewhere? Any suggestions would be appreciated. Scott