Thanks for your answers. Your input is really appreciated :-)

@Paul Elschot:
Thanks for the hint. I guess I could use coord() to penalize missing
terms like this:

Query: a b c d
Doc A: a b c d => sloppyFreq(0) * coord(4, 4) = 1
Doc B: a b c => sloppyFreq(0) * coord(3, 4) = 0,75

Doc would score higher. I guess that might be a valid solution.

There is a drawback though, i.e. sloppyFreq(1) * coord(4, 4) = 0,5

So a perfect match with one insertion would score less than a 3 of 4
match with no slop.

As for spanqueries:
My implementation is based of the default PhraseQuery with slop > 0. I
don't know the inner workings of SpanQueries, but what you describe
sounds alot like what the PhraseQuery does as well (i.e. calculate max
distance between last and first term, and use that with sloppyFreq()).

I chose PhraseQuery as base of my work, because I felt that it would
offer better performance than firing off a plethora of spanqueries to
express the same query.

Long story short: My problem would generalize to spanqueries if
spanqueries would face the problem of deleted terms. But I guess they
don't?!

@Chris Hostetter: You are absolutely right. But it shows off into
which direction it could go to. Perhaps I could add +1 (or some other
amount) as additional penalty to the maximum error for missing terms
to distinguish between these cases further.

But still this could lead to a case where

Doc A: a b c x1 x2 [more x...] xn d
will be scored lower than
Doc B: a b c
(because the distance of A can exceed the penalty for the missing term
- its only a matter of choosing the right n)
which is questionable as well.

2007/3/6, Chris Hostetter <[EMAIL PROTECTED]>:
: My initial idea was to penalize a missing term position with its maximum 
error.
:
: Consider this:
: Query:  a b c d
: Document A: b c d
:
: Term a is missing, score it as if it was at the worst position possible
:
: result:       b c d a
: pos. diffs: -1 -1 -1 +3

side comment: this doesn't sound very useful, a document containing "b c
d" matches equally to a doc containing "b c d a" ? ... shouldn't a doc
containing "b c d a" be considered a much better match since it at least
contains all of the terms close together?



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to