+1 for a "completely accurate" (each snippet shown matches the query)
and fast highlighter, but it's a real challenge because you need a
clean way to recursively iterate all positions for any (even
non-positional) queries (what LUCENE-2878 will give us).  To properly
handle your (+A +B) (+C +D) example, you'd need BooleanQuery to
participate in enumerating the positions...

Yes, I think WSTE could just pull from the postings.


Mike McCandless

http://blog.mikemccandless.com


On Fri, Oct 10, 2014 at 12:38 AM, [email protected]
<[email protected]> wrote:
> I’m working on making highlighting both accurate and fast.  By “accurate”, I
> mean the highlights need to accurately reflect a match given the query and
> various possible query types (to include SpanQueries and MultiTermQueries
> and obviously phrase queries and the usual suspects).  The fastest
> highlighter we’ve got in Lucene is the PostingsHighlighter but it throws out
> any positional nature in the query and can highlight more inaccurately than
> the other two highlighters. The most accurate is the default highlighter,
> although I can see some simplifications it makes that could lead to
> inaccuracies.
>
> The default highlighter’s “WeightedSpanTermExtractor” is interesting — it
> uses a MemoryIndex built from re-analyzing the text, and it executes the
> query against this mini index; kind of.  A recent experiment I did was to
> have the MemoryIndex essentially wrap the “Terms” from term vectors.  It
> works and saves memory, although, at least for large docs (which I’m
> optimizing for) the real performance hit is in un-inverting the TokenStream
> in TokenSources to include sorting the thousands of tokens -- assuming you
> index term vectors of course.  But with my attention now on the
> PostingsHighlighter (because it’s the fastest and offsets are way cheaper
> than term vectors), I believe WeightedSpanTermExtractor could simply use
> Lucene’s actual IndexReader — no?  It seems so obvious to me now I wonder
> why it wasn’t done this way in the first place — all WSTE has to do is
> advance() to the document being highlighted for applicable terms.  Am I
> overlooking something?
>
> WeightedSpanTermExtractor is somewhat accurate but my reading of its source
> shows it takes short-cuts I’d like to eliminate.  For example if the query
> is “(A && B) || (C && D)” and if the document doesn’t have ‘D’ then it
> should ideally NOT highlight ‘C’ in this document, just ‘A’ and ‘B’.  I
> think I can solve that using Scorers.getChildScorers to see which scorers
> (and thus queries) actually matched.  Another example is that it views
> SpanQueries at the top level only and records the entire span for all terms
> it is comprised of.  So if you had a couple Phrase SpanQueries (actually
> ordered 0-slop SpanNearQueries) joined by a SpanNearQuery to be within ~50
> positions of each other, I believe it would highlight any other occurrence
> of the words involved in-between the sub-SpanQueries. This looks hard to
> solve but I think for starters, SpanScorer needs a getter for the Spans
> instance, and furthermore Spans needs getChildSpans() just as Scorers expose
> child scorers.  I could see myself relaxing this requirement because of it’s
> complexity and simply highlighting the entire span, even if it could be a
> big highlight.
>
> Perhaps the “Nuke Spans” effort might make this all much easier but I
> haven’t looked yet because that’s still not done yet.  It’s encouraging to
> see Alan making recent progress there.
>
> Any thoughts about any of this, guys?
>
> p.s. When I’m done, I expect to have no problem getting open-source
> permission from the sponsor commissioning this effort.
>
> ~ David Smiley
> Freelance Apache Lucene/Solr Search Consultant/Developer
> http://www.linkedin.com/in/davidwsmiley

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to