[
https://issues.apache.org/jira/browse/LUCENE-6034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Smiley updated LUCENE-6034:
---------------------------------
Attachment: LUCENE-6034_Simplify_MemoryIndex.patch
The attached patch is just for simplifying MemoryIndex (incl. inlining separate
MemoryIndexNormDocValues.java file). The net LOC is ~ -90.
> MemoryIndex should be able to wrap TermVector Terms
> ---------------------------------------------------
>
> Key: LUCENE-6034
> URL: https://issues.apache.org/jira/browse/LUCENE-6034
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/highlighter
> Reporter: David Smiley
> Assignee: David Smiley
> Fix For: 5.0
>
> Attachments: LUCENE-6034.patch, LUCENE-6034.patch, LUCENE-6034.patch,
> LUCENE-6034_Simplify_MemoryIndex.patch
>
>
> The default highlighter has a "WeightedSpanTermExtractor" that uses
> MemoryIndex for certain queries -- basically phrases, SpanQueries, and the
> like. For lots of text, this aspect of highlighting is time consuming and
> consumes a fair amount of memory. What also consumes memory is that it wraps
> the tokenStream in CachingTokenFilter in this case. But if the underlying
> TokenStream is actually from TokenSources (wrapping TermVector Terms), this
> is all needless! Furthermore, MemoryIndex doesn't support payloads.
> The patch here has 3 aspects to it:
> * Internal refactoring to MemoryIndex to simplify it by maintaining the
> fields in a sorted state using a TreeMap. The ramifications of this led to
> reduced LOC for this file, even with the other features I added. It also
> puts the FieldInfo on the Info, and thus there's one less data structure to
> keep around. I suppose if there are a huge variety of fields in MemoryIndex,
> the aggregated N*Log(N) field lookup could add up, but that seems very
> unlikely. I also brought in the MemoryIndexNormDocValues as a simple
> anonymous inner class - it's super-simple after all, not worth having in a
> separate file.
> * New MemoryIndex.addField(String fieldName, Terms) method. In this case,
> MemoryIndex is providing the supporting wrappers around the underlying Terms
> so that it appears as an Index. In so doing, MemoryIndex supports payloads
> for such fields.
> * WeightedSpanTermExtractor now detects TokenSources' wrapping of Terms and
> it supplies this to MemoryIndex.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]