On Fri, Oct 10, 2014 at 12:38 AM, [email protected]
<[email protected]> wrote:
> The fastest
> highlighter we’ve got in Lucene is the PostingsHighlighter but it throws out
> any positional nature in the query and can highlight more inaccurately than
> the other two highlighters. mission from the sponsor commissioning this 
> effort.
>

Thats because it tries to summarize the document contents wrt to the
query, so the user can decide if its relevant (versus being a debugger
for span queries, or whatever). The algorithms used to do this don't
really get benefits from positions, because they are the same ones
used for regular IR.

In short, the "inaccuracy" is important, because this highlighter is
trying to do something different than the other highlighters.

The reason it might be faster in comparison has less to do with the
fact it reads offsets from the postings lists and more to do with the
fact it does not have bad O(n^2) etc algorithms that the other
highlighters do. Its not faster: it just does not blow up.

I don't think you can safely make this highlighter do what you would
like without compromising these goals (relevance of passages, and not
blowing up): for a phrase or span, how can you compute the
within-document freq() without actually reading all those positions
(means blowing up)? With terms its simple, effective, and does not
blow up: freq() -> IDF. Its the same term dependence issue from
regular scoring, not going to be solved in an email to lucene jira
list. The best I can do that is safe is
https://issues.apache.org/jira/browse/LUCENE-4909, and nobody seemed
interested, so it sits.

So IMO, for scoring spans or intervals or whatever, a different
highlighter is needed that makes some compromises (worse relevance,
willingness to blow up). Hopefully they would be contained so that
most users aren't impacted heavily and blowing up or getting badly
ranked sentences. But I don't think we should make it so
PostingsHighlighter can blow up. There are already two other
highlighters for that.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to