I’m working on making highlighting both accurate and fast.  By “accurate”,
I mean the highlights need to accurately reflect a match given the query
and various possible query types (to include SpanQueries and
MultiTermQueries and obviously phrase queries and the usual suspects).  The
fastest highlighter we’ve got in Lucene is the PostingsHighlighter but it
throws out any positional nature in the query and can highlight more
inaccurately than the other two highlighters. The most accurate is the
default highlighter, although I can see some simplifications it makes that
could lead to inaccuracies.

The default highlighter’s “WeightedSpanTermExtractor” is interesting — it
uses a MemoryIndex built from re-analyzing the text, and it executes the
query against this mini index; kind of.  A recent experiment I did was to
have the MemoryIndex essentially wrap the “Terms” from term vectors.  It
works and saves memory, although, at least for large docs (which I’m
optimizing for) the real performance hit is in un-inverting the TokenStream
in TokenSources to include sorting the thousands of tokens -- assuming you
index term vectors of course.  But with my attention now on the
PostingsHighlighter (because it’s the fastest and offsets are way cheaper
than term vectors), I believe WeightedSpanTermExtractor could simply use
Lucene’s actual IndexReader — no?  It seems so obvious to me now I wonder
why it wasn’t done this way in the first place — all WSTE has to do is
advance() to the document being highlighted for applicable terms.  Am I
overlooking something?

WeightedSpanTermExtractor is somewhat accurate but my reading of its source
shows it takes short-cuts I’d like to eliminate.  For example if the query
is “(A && B) || (C && D)” and if the document doesn’t have ‘D’ then it
should ideally NOT highlight ‘C’ in this document, just ‘A’ and ‘B’.  I
think I can solve that using Scorers.getChildScorers to see which scorers
(and thus queries) actually matched.  Another example is that it views
SpanQueries at the top level only and records the entire span for all terms
it is comprised of.  So if you had a couple Phrase SpanQueries (actually
ordered 0-slop SpanNearQueries) joined by a SpanNearQuery to be within ~50
positions of each other, I believe it would highlight any other occurrence
of the words involved in-between the sub-SpanQueries. This looks hard to
solve but I think for starters, SpanScorer needs a getter for the Spans
instance, and furthermore Spans needs getChildSpans() just as Scorers
expose child scorers.  I could see myself relaxing this requirement because
of it’s complexity and simply highlighting the entire span, even if it
could be a big highlight.

Perhaps the “Nuke Spans” effort might make this all much easier but I
haven’t looked yet because that’s still not done yet.  It’s encouraging to
see Alan making recent progress there.

Any thoughts about any of this, guys?

p.s. When I’m done, I expect to have no problem getting open-source
permission from the sponsor commissioning this effort.

~ David Smiley
Freelance Apache Lucene/Solr Search Consultant/Developer
http://www.linkedin.com/in/davidwsmiley

Reply via email to