Robichaud, Jean-Philippe wrote:

Probably the simplest/ideal schema of the ScoreObject would be something
like a hashtable with Term being the keys and a TermScoreObject the value.
The TermScoreObject would be filled at search time (if asked) and would
contain all values used in the calculation of the "similarity score". That
way we could easily know what is the contribution of a specific term to the
overall score.


Jean-Philippe,

Some of us have talked about a score object in the past and agree that this would be a very good thing. In addition to providing a sounder foundation for explanation, such a mechanism could help to provide better scoring. For example, one limitation in Lucene now is that score normalization is ad hoc -- all scores are divided by the highest score IF the highest score is greater than 1, and whether or not the highest unnormalized score is greater that 1 is pretty much random. This yields a situation where scores across multiple searches are not comparable (notwithstanding many applications that do compare them, getting random results). With a score object, one would like to keep additional information, e.g., a count of boost-weighted query terms and the boost-weighted percentage of such terms that were matched by each result. This could provide a more intrinsic normalization scheme, e.g., defining the highest score as the boost-weighted percentage of matched query terms and dividing all scores by the same constant to achieve this. (Some additional refinements are necessary to handle things like MultiTermQuery's, which rewrite to BooleanQuery's with coord disabled -- such lists of alternate query terms should count as one term).

That is one addition example of something score objects could be used for. A general mechanism should provide for easy extension such that different scoring classes could collect, record and aggregate different information for various purposes.

I've wanted to work on this for a while but haven't found the time. I know Doug has had a score object mechanism on his radar screen (he first suggested this approach to me as a solution to the normalization issue I'm concerned about). I expect he has a good approach in mind. It would be great if you'd tackle this -- I'd be happy to help if that makes sense.

Chuck


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to