[ 
https://issues.apache.org/jira/browse/SOLR-6692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley updated SOLR-6692:
-------------------------------
    Attachment: 
SOLR-6692_hl_maxAnalyzedChars_cumulative_multiValued,_and_more.patch

This patch includes several things. I'm sorry if it's doing too many things; I 
will break it down into separate patches and/or JIRA issue if advised to. I 
will certainly point out the separate parts in CHANGES.txt so users know what's 
going on.  Nevertheless the patch is small IMO and the net LOC is practically 0.

* hl.maxAnalyzedChars is budgeted across all values for the field being 
highlighted.  The budget is decremented by the field length at each iteration, 
and so progressively smaller limits are passed on through until it's <= 0 at 
which we we exit the loop. I added a test for this which randomly chooses 
between a field with term vectors or not.
* Refactor/extensibility:
** All methods that were private are now protected.  This widens the scope of 
possibilities for subclassing without having to fork this code.
** The no-arg constructor isn't used; I removed it and made the SolrCore field 
final as a clarification.  If anyone ever tried to use the no-arg constructor 
(I have), they would have soon realized that was not an option since an NPE 
would be thrown from init().
** I extracted a method getFieldValues whose sole job is to get the field 
values (Strings) to be highlighted given the provided field name & some other 
parameters. This is a useful extension point so that a subclass can get the 
field values from another field (i.e. the source of a copyField).  Till now, 
people had to use hl.requireFieldMatch=false which had its drawbacks in terms 
of highlight precision.  A side-benefit is that this method is aware of 
hl.maxMultiValuedToMatch and hl.maxAnalyzedChars, which will limit the values 
it returns. This aids the term-vector code path which can now in more 
circumstances see when there is only one value to highlight, and thus forgo 
wrapping the term vector stream with a OffsetWindow filter, which is a big 
penalty to avoid.
* hl.usePhraseHighlighter can now be toggled per-field.
* It includes a nocommit to confirm from SOLR-4656 (Erick) that the intention 
of hl.maxMultiValuedToMatch is to limit _fragments_, not matching values, 
despite the parameter name hinting otherwise.
* I fixed a bug with hl.maxMultiValuedToMatch in which it would decrement its 
counter when in fact the fragment didn't match. This bug would only occur when 
hl.preserveMulti=true.
* I fixed a small bug in ordering the fragments by score.  It used Math.round() 
which will coalesce values close to 0 to appear as the same weighting. Now it 
simply uses Float.compare(a,b).
* note: the code changes below the field value loop, except for the small score 
order bug I just mentioned, are purely code clean-up and don't change behavior. 
The code was more complex due to it thinking a fragment could be null when in 
fact by that point it's impossible.

> hl.maxAnalyzedChars should apply cumulatively on a multi-valued field
> ---------------------------------------------------------------------
>
>                 Key: SOLR-6692
>                 URL: https://issues.apache.org/jira/browse/SOLR-6692
>             Project: Solr
>          Issue Type: Improvement
>          Components: highlighter
>            Reporter: David Smiley
>             Fix For: 5.0
>
>         Attachments: 
> SOLR-6692_hl_maxAnalyzedChars_cumulative_multiValued,_and_more.patch
>
>
> in DefaultSolrHighlighter, the hl.maxAnalyzedChars figure is used to 
> constrain how much text is analyzed before the highlighter stops, in the 
> interests of performance.  For a multi-valued field, it effectively treats 
> each value anew, no matter how much text it was previously analyzed for other 
> values for the same field for the current document. The PostingsHighlighter 
> doesn't work this way -- hl.maxAnalyzedChars is effectively the total budget 
> for a field for a document, no matter how many values there might be.  It's 
> not reset for each value.  I think this makes more sense.  When we loop over 
> the values, we should subtract from hl.maxAnalyzedChars the length of the 
> value just checked.  The motivation here is consistency with 
> PostingsHighlighter, and to allow for hl.maxAnalyzedChars to be pushed down 
> to term vector uninversion, which wouldn't be possible for multi-valued 
> fields based on the current way this parameter is used.
> Interestingly, I noticed Solr's use of FastVectorHighlighter doesn't honor 
> hl.maxAnalyzedChars as the FVH doesn't have a knob for that.  It does have 
> hl.phraseLimit which is a limit that could be used for a similar purpose, 
> albeit applied differently.
> Furthermore, DefaultSolrHighligher.doHighlightingByHighlighter should exit 
> early from it's field value loop if it reaches hl.snippets, and if 
> hl.preserveMulti=true



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to