Re: Highlighters, accurate highlighting, and the PostingsHighlighter

Walter Underwood Fri, 10 Oct 2014 09:57:07 -0700

I think of snippets and highlighting as explaining to the end user why the 
engine decided this was relevant. This tends to increase the user’s trust in 
the engine even when the results are not relevant.


wunder
Walter Underwood
[email protected]
http://observer.wunderwood.org/


On Oct 10, 2014, at 9:37 AM, Uwe Schindler <[email protected]> wrote:

> Hi,
>  
> > I’m confused how inaccuracy is a feature, but nevertheless I appreciate 
> > that the postings highlighter as-is is good enough for most users.  Thanks 
> > for your awesome work on this highlighter, by the way!
>  
> The problem here are 2 different opinions how highlighting should look like. 
> What is always wanted by most “technical” people is *not* “highlighting” like 
> “showing where the search terms match in a specific document to make the user 
> himself allow to ‘relevance test’ a specific result”, instead technical 
> people want to have “query debugging”: exactly showing why a query matches. 
> But this is not what highlighting was made for (especially not postings 
> highlighter!).
>  
> I think Robert’s intention behind the postings highlighter is – and I fully 
> think he is right – is to just give the “end user” (not “technical user”) a 
> quick overview of where the terms match in a document, completely ignoring 
> the type of query. You just want to get a quick context in the document where 
> the terms of your query match. I always explain it to customers like “allow 
> the end user to relevance rank the document themselves”.
>  
> Uwe
>  
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [email protected]
>  
> From: [email protected] [mailto:[email protected]] 
> Sent: Friday, October 10, 2014 4:46 PM
> To: [email protected]
> Subject: Re: Highlighters, accurate highlighting, and the PostingsHighlighter
>  
> On Fri, Oct 10, 2014 at 7:13 AM, Robert Muir <[email protected]> wrote:
> On Fri, Oct 10, 2014 at 12:38 AM, [email protected]
> <[email protected]> wrote:
> > The fastest
> > highlighter we’ve got in Lucene is the PostingsHighlighter but it throws out
> > any positional nature in the query and can highlight more inaccurately than
> > the other two highlighters. mission from the sponsor commissioning this 
> > effort.
> >
> 
> Thats because it tries to summarize the document contents wrt to the
> query, so the user can decide if its relevant (versus being a debugger
> for span queries, or whatever). The algorithms used to do this don't
> really get benefits from positions, because they are the same ones
> used for regular IR.
> 
> In short, the "inaccuracy" is important, because this highlighter is
> trying to do something different than the other highlighters.
>  
> I’m confused how inaccuracy is a feature, but nevertheless I appreciate that 
> the postings highlighter as-is is good enough for most users.  Thanks for 
> your awesome work on this highlighter, by the way!
>  
> The reason it might be faster in comparison has less to do with the
> fact it reads offsets from the postings lists and more to do with the
> fact it does not have bad O(n^2) etc algorithms that the other
> highlighters do. Its not faster: it just does not blow up.
>  
> Well, it isn’t cheap to re-analyze the document text (what the default 
> highlighter does) nor to read term-vectors and sort the tokens (what the 
> default highlighter does when term vectors are available).  At least not with 
> big docs (lots of text to analyze or large term vectors to read and sort).  
> My first steps were to try and make the default highlighter faster but it 
> still isn’t fast enough and it isn’t accurate enough either (for me).
>  
> I looked at the FVH a little but thought I’d skip the heft of term vectors 
> and use PostingsHighlighter, now that I’m willing to break open these complex 
> beasts and build what’s needed to meet my accuracy requirements.
>  
> Do you foresee any O(n^2) algorithms in what I’ve said?
>  
> I don't think you can safely make this highlighter do what you would
> like without compromising these goals (relevance of passages, and not
> blowing up): for a phrase or span, how can you compute the
> within-document freq() without actually reading all those positions
> (means blowing up)? With terms its simple, effective, and does not
> blow up: freq() -> IDF. Its the same term dependence issue from
> regular scoring, not going to be solved in an email to lucene jira
> list. The best I can do that is safe is
> https://issues.apache.org/jira/browse/LUCENE-4909, and nobody seemed
> interested, so it sits.
>  
> I plan to make simple approximations to score one passage relative to 
> another.  The passage with the most diversity in query terms wins, or at 
> least is the highest scoring factor. Then, low within-doc-freq (on a per-term 
> basis).  Then, high freq in the passage.  Then, shortness of passage and 
> closeness to the beginning.  In short, something fast to compute and pretty 
> reasonable — my principal requirement is highlighting accuracy, and needs to 
> support a lot of query types (incl. custom span queries).
>  
> So IMO, for scoring spans or intervals or whatever, a different
> highlighter is needed that makes some compromises (worse relevance,
> willingness to blow up). Hopefully they would be contained so that
> most users aren't impacted heavily and blowing up or getting badly
> ranked sentences. But I don't think we should make it so
> PostingsHighlighter can blow up. There are already two other
> highlighters for that.
>  
> Ok; I’m not sure yet how much from the PostingsHighlighter I’ll re-use but 
> there is a lot of it that is pertinent to my aims.  So much so, probably, 
> that I can see it being a subclass, or at least belong in the same package.  
> It uses postings/offsets, (and not term vectors and without re-analzing text).
>  
> Thanks for your input, Rob.
>  
> ~ David

Re: Highlighters, accurate highlighting, and the PostingsHighlighter

Reply via email to