[
https://issues.apache.org/jira/browse/LUCENE-8349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16519770#comment-16519770
]
David Smiley commented on LUCENE-8349:
--------------------------------------
This highlighter is impressive for not a lot of code! Great work [~romseygeek]!
Some observations:
* Offsets can be in postings, term vectors, or via analysis. I didn't think
you'd get to all 3 in your first patch/PR but you did! You could probably
"borrow" the "TermVectorFilteredLeafReader" mechanism from the UH to address
the hybrid case.
* I don't see any special multi-term query (e.g. wildcards) processing but I
suspect MTQs will _just work_. However, I'm really concerned about the
performance for when offsets are in postings, since the term explosion could be
huge and slow to advance. FWIW the UH solves this in two ways. First it visits
the Query tree looking for MTQs, and if found then switches away from offsets
in postings to analysis as a better trade-off. Granted it won't _always_ be the
best trade-off but it addresses the worst case. Also, if a simple term vector
(no offsets) are present then it can overlays it with the real index using
TermVectorFilteredLeafReader, thus avoiding analysis.
* The so-called "requireFieldMatch" option seen elsewhere is mandatory here –
the query must match with whatever fields it specifies. Perhaps this
OffsetsReader thing could map all terms() asked for into the field to be
highlighted?
* Obviously requires Weight.matches on the underlying query tree to work.
SpanQueries don't *yet* but I assume that'll be added soon in another issue.
* I didn't notice any passage scoring/ranking.
* I didn't notice any javadocs yet but you'll get there I assume.
* I really like your reuse of a Lucene Document with TextField entries to
represent a set of passages for the document's highlighted fields!
* I like that it highlights per-field value as opposed to the PH/UH which
highlights a joined string across all values, which is awkward for the UH. This
will allow for better short-circuiting options in the future. *However*, I'm
concerned about the performance implications from what I see here as I believe
the highlighting process starts over again on the query for each value,
effectively creating an O(N^2) problem as the underlying offsets needs to be
skipped over to get to the right window over and over again.
BTW some complexity in the UH that I don't see here is related to query tree
visiting, such as for MultiTermQueries and also for getting all the terms
(granted the latter is easy and not much code). This information is put to good
use by building a MemoryIndex collecting only the pertinent terms and not
bothering with the rest.
If this highlighter moves forward, I figure at some point you're going to have
to address visiting/walking queries (e.g. to look for MTQs) and/or perhaps
rewriting them. Consider these related issues: LUCENE-8184 LUCENE-8160
LUCENE-3041
> Highlighter based on Matches API
> --------------------------------
>
> Key: LUCENE-8349
> URL: https://issues.apache.org/jira/browse/LUCENE-8349
> Project: Lucene - Core
> Issue Type: New Feature
> Reporter: Alan Woodward
> Assignee: Alan Woodward
> Priority: Major
> Time Spent: 10m
> Remaining Estimate: 0h
>
> I started trying to integrate the Matches API into the UnifiedHighlighter,
> but there's a fairly heavy impedance mismatch between the way the two of them
> work (eg Matches doesn't give you freqs, it's entirely lazy, the UH tries to
> do things by field rather than by doc). So instead, I thought I'd try and
> write a new highlighter based around Matches, and see what it looks like.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]