[
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Smiley updated LUCENE-7438:
---------------------------------
Attachment: LUCENE_7438_UH_benchmark.patch
This is an update to the benchmark patch. I removed the existing benchmark
highlighting abstraction (that I felt was a bit obsolete), and with it the
existing two highlighting benchmark classes: SearchTravRetHighlightTask,
SearchTravRetVectorHighlightTask. The patch actually replaces
SearchTravRetHighlightTask with the one from the previous patch, and so by
class name it still exists, but is internally very different as it tests all
highlighters in all offset modes. It has the 2 highlighters-\*.alg added in
the last patch, and I kept the 3 query-\*.txt files too. I removed the
existing highlight .alg files except for one which I updated --
standard-highlights-notv.alg -> highlights.alg. I also added a "UH" highlight
mode to the benchmark, which is the UH's default mode operation in which it
detects the offset source based on FieldInfo.
I tweaked the build.xml & .gitignore to avoid work/ and temp/ and to allow them
to be symbolic links.
The only thing I feel bad about was outright removing some tests related to the
old highlight abstraction... meanwhile there are no new tests for this new one.
I rationalize this as it's better to finally have a more up-to-date way to
highlight all highlighters in all modes (and in a consistent way) than it is to
have something incomplete that is nevertheless tested.
I'll commit this in a couple days.
> UnifiedHighlighter
> ------------------
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/highlighter
> Affects Versions: 6.2
> Reporter: Timothy M. Rodriguez
> Assignee: David Smiley
> Fix For: 6.3
>
> Attachments: LUCENE-7438.patch, LUCENE_7438_UH_benchmark.patch,
> LUCENE_7438_UH_benchmark.patch, LUCENE_7438_UH_small_changes.patch
>
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is
> able to highlight using offsets in either postings, term vectors, or from
> analysis (a TokenStream). Lucene’s existing highlighters are mostly
> demarcated along offset source lines, whereas here it is unified -- hence
> this proposed name. In this highlighter, the offset source strategy is
> separated from the core highlighting functionalty. The UnifiedHighlighter
> further improves on the PostingsHighlighter’s design by supporting accurate
> phrase highlighting using an approach similar to the standard highlighter’s
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset
> source strategythat utilizes postings and “light” term vectors (i.e. just the
> terms) for highlighting multi-term queries (wildcards) without resorting to
> analysis. Phrase highlighting and wildcard highlighting can both be disabled
> if you’d rather highlight a little faster albeit not as accurately reflecting
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the
> other highlighters and the results were exciting! It’s tempting to share
> those results but it’s definitely due for another benchmark, so we’ll work on
> that. Performance was the main motivator for creating the UnifiedHighlighter,
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy
> requirements) wasn’t fast enough, even with term vectors along with several
> improvements we contributed back, and even after we forked it to highlight in
> multiple threads.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]