On Fri, Oct 10, 2014 at 12:38 AM, [email protected] <[email protected]> wrote: > The fastest > highlighter we’ve got in Lucene is the PostingsHighlighter but it throws out > any positional nature in the query and can highlight more inaccurately than > the other two highlighters. mission from the sponsor commissioning this > effort. >
Thats because it tries to summarize the document contents wrt to the query, so the user can decide if its relevant (versus being a debugger for span queries, or whatever). The algorithms used to do this don't really get benefits from positions, because they are the same ones used for regular IR. In short, the "inaccuracy" is important, because this highlighter is trying to do something different than the other highlighters. The reason it might be faster in comparison has less to do with the fact it reads offsets from the postings lists and more to do with the fact it does not have bad O(n^2) etc algorithms that the other highlighters do. Its not faster: it just does not blow up. I don't think you can safely make this highlighter do what you would like without compromising these goals (relevance of passages, and not blowing up): for a phrase or span, how can you compute the within-document freq() without actually reading all those positions (means blowing up)? With terms its simple, effective, and does not blow up: freq() -> IDF. Its the same term dependence issue from regular scoring, not going to be solved in an email to lucene jira list. The best I can do that is safe is https://issues.apache.org/jira/browse/LUCENE-4909, and nobody seemed interested, so it sits. So IMO, for scoring spans or intervals or whatever, a different highlighter is needed that makes some compromises (worse relevance, willingness to blow up). Hopefully they would be contained so that most users aren't impacted heavily and blowing up or getting badly ranked sentences. But I don't think we should make it so PostingsHighlighter can blow up. There are already two other highlighters for that. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
