RE: Highlighters, accurate highlighting, and the PostingsHighlighter

Uwe Schindler Fri, 10 Oct 2014 09:38:24 -0700

Hi,

> I’m confused how inaccuracy is a feature, but nevertheless I appreciate that 
> the postings highlighter as-is is good enough for most users.  Thanks for 
> your awesome work on this highlighter, by the way!

The problem here are 2 different opinions how highlighting should look like. 
What is always wanted by most “technical” people is *not* “highlighting” like 
“showing where the search terms match in a specific document to make the user 
himself allow to ‘relevance test’ a specific result”, instead technical people 
want to have “query debugging”: exactly showing why a query matches. But this 
is not what highlighting was made for (especially not postings highlighter!).

I think Robert’s intention behind the postings highlighter is – and I fully 
think he is right – is to just give the “end user” (not “technical user”) a 
quick overview of where the terms match in a document, completely ignoring the 
type of query. You just want to get a quick context in the document where the 
terms of your query match. I always explain it to customers like “allow the end 
user to relevance rank the document themselves”.

Uwe

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

 <http://www.thetaphi.de/> http://www.thetaphi.de

eMail: [email protected]

From: [email protected] [mailto:[email protected]] 
Sent: Friday, October 10, 2014 4:46 PM
To: [email protected]
Subject: Re: Highlighters, accurate highlighting, and the PostingsHighlighter

On Fri, Oct 10, 2014 at 7:13 AM, Robert Muir <[email protected]> wrote:

On Fri, Oct 10, 2014 at 12:38 AM, [email protected]
<[email protected]> wrote:
> The fastest
> highlighter we’ve got in Lucene is the PostingsHighlighter but it throws out
> any positional nature in the query and can highlight more inaccurately than
> the other two highlighters. mission from the sponsor commissioning this 
> effort.
>

Thats because it tries to summarize the document contents wrt to the
query, so the user can decide if its relevant (versus being a debugger
for span queries, or whatever). The algorithms used to do this don't
really get benefits from positions, because they are the same ones
used for regular IR.

In short, the "inaccuracy" is important, because this highlighter is
trying to do something different than the other highlighters.

I’m confused how inaccuracy is a feature, but nevertheless I appreciate that 
the postings highlighter as-is is good enough for most users.  Thanks for your 
awesome work on this highlighter, by the way!

The reason it might be faster in comparison has less to do with the
fact it reads offsets from the postings lists and more to do with the
fact it does not have bad O(n^2) etc algorithms that the other
highlighters do. Its not faster: it just does not blow up.

Well, it isn’t cheap to re-analyze the document text (what the default 
highlighter does) nor to read term-vectors and sort the tokens (what the 
default highlighter does when term vectors are available).  At least not with 
big docs (lots of text to analyze or large term vectors to read and sort).  My 
first steps were to try and make the default highlighter faster but it still 
isn’t fast enough and it isn’t accurate enough either (for me).

I looked at the FVH a little but thought I’d skip the heft of term vectors and 
use PostingsHighlighter, now that I’m willing to break open these complex 
beasts and build what’s needed to meet my accuracy requirements.

Do you foresee any O(n^2) algorithms in what I’ve said?

I don't think you can safely make this highlighter do what you would
like without compromising these goals (relevance of passages, and not
blowing up): for a phrase or span, how can you compute the
within-document freq() without actually reading all those positions
(means blowing up)? With terms its simple, effective, and does not
blow up: freq() -> IDF. Its the same term dependence issue from
regular scoring, not going to be solved in an email to lucene jira
list. The best I can do that is safe is
https://issues.apache.org/jira/browse/LUCENE-4909, and nobody seemed
interested, so it sits.

I plan to make simple approximations to score one passage relative to another.  
The passage with the most diversity in query terms wins, or at least is the 
highest scoring factor. Then, low within-doc-freq (on a per-term basis).  Then, 
high freq in the passage.  Then, shortness of passage and closeness to the 
beginning.  In short, something fast to compute and pretty reasonable — my 
principal requirement is highlighting accuracy, and needs to support a lot of 
query types (incl. custom span queries).

So IMO, for scoring spans or intervals or whatever, a different
highlighter is needed that makes some compromises (worse relevance,
willingness to blow up). Hopefully they would be contained so that
most users aren't impacted heavily and blowing up or getting badly
ranked sentences. But I don't think we should make it so
PostingsHighlighter can blow up. There are already two other
highlighters for that.

Ok; I’m not sure yet how much from the PostingsHighlighter I’ll re-use but 
there is a lot of it that is pertinent to my aims.  So much so, probably, that 
I can see it being a subclass, or at least belong in the same package.  It uses 
postings/offsets, (and not term vectors and without re-analzing text).

Thanks for your input, Rob.

~ David

RE: Highlighters, accurate highlighting, and the PostingsHighlighter

Reply via email to