[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley updated LUCENE-7438:
---------------------------------
    Attachment: LUCENE_7438_UH_benchmark.patch

This is an update to the benchmark patch.  I removed the existing benchmark 
highlighting abstraction (that I felt was a bit obsolete), and with it the 
existing two highlighting benchmark classes: SearchTravRetHighlightTask, 
SearchTravRetVectorHighlightTask.  The patch actually replaces 
SearchTravRetHighlightTask with the one from the previous patch, and so by 
class name it still exists, but is internally very different as it tests all 
highlighters in all offset modes.  It has the 2 highlighters-\*.alg added in 
the last patch, and I kept the 3 query-\*.txt files too.  I removed the 
existing highlight .alg files except for one which I updated -- 
standard-highlights-notv.alg -> highlights.alg.  I also added a "UH" highlight 
mode to the benchmark, which is the UH's default mode operation in which it 
detects the offset source based on FieldInfo.

I tweaked the build.xml & .gitignore to avoid work/ and temp/ and to allow them 
to be symbolic links.

The only thing I feel bad about was outright removing some tests related to the 
old highlight abstraction... meanwhile there are no new tests for this new one. 
 I rationalize this as it's better to finally have a more up-to-date way to 
highlight all highlighters in all modes (and in a consistent way) than it is to 
have something incomplete that is nevertheless tested.

I'll commit this in a couple days.

> UnifiedHighlighter
> ------------------
>
>                 Key: LUCENE-7438
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7438
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>    Affects Versions: 6.2
>            Reporter: Timothy M. Rodriguez
>            Assignee: David Smiley
>             Fix For: 6.3
>
>         Attachments: LUCENE-7438.patch, LUCENE_7438_UH_benchmark.patch, 
> LUCENE_7438_UH_benchmark.patch, LUCENE_7438_UH_small_changes.patch
>
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to