[
https://issues.apache.org/jira/browse/LUCENE-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yonik Seeley reopened LUCENE-6031:
----------------------------------
Reopening - Solr highlighting tests do not pass after this commit.
> TokenSources optimization, avoid sort
> -------------------------------------
>
> Key: LUCENE-6031
> URL: https://issues.apache.org/jira/browse/LUCENE-6031
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/highlighter
> Reporter: David Smiley
> Assignee: David Smiley
> Fix For: 5.0, Trunk
>
> Attachments: LUCENE-6031.patch, LUCENE-6031.patch
>
>
> TokenSources.java, in the highlight module, is a facade that returns a
> TokenStream for a field by either un-inverting & converting the TermVector
> Terms, or by text re-analysis if TermVectors are unavailable or don't have
> the right options. TokenSources is used by the default highlighter, which is
> the most accurate highlighter we've got. When documents are large (say
> hundreds of kilobytes on up), I found that most of the highlighter's activity
> was up-front spent un-inverting & converting the term vector to a
> TokenStream, not on the actual/real highlighting that follows. Much of that
> time was on a huge sort of hundreds of thousands of Tokens. Time was also
> spent doing lots of String conversion and char copying, and it used a lot of
> memory, too.
> In this patch, I overhauled TokenStreamFromTermPositionVector.java, and I
> removed similar logic in TokenSources that was used in circumstances when
> positions weren't available but offsets were. This class can un-invert term
> vectors that have positions *and/or* offsets (at least one). It doesn't
> sort. It places Tokens _directly_ into an array of tokens directly indexed
> by position. When positions aren't available, the startOffset/8 is a
> substitute. I've got a more light-weight Token inner class used in place of
> the former and deprecated Token that ultimately forms a linked-list when the
> process is done. There is no string conversion; character copying is
> minimized. The Token array is GC'ed after initialization, it's only needed
> during construction.
> Misc:
> * It implements reset() efficiently so it need not be wrapped in
> CachingTokenFilter (I'll supply a patch later on this).
> * It only fetches payloads if you ask for them by adding the attribute (the
> default highlighter won't add the attribute).
> * It exposes the underlying TermVector terms via a getter too, which is
> needed by another patch to follow later.
> A key assumption is that the position increment gap or first position isn't
> gigantic, as that will create wasted space and the linked-list formation
> ultimately has to visit all the slots. We also assume that there aren't a
> ton of tokens at the same position, since inserting new tokens in sorted
> order is O(N^2) where 'N' is the average co-occurring token length.
> My performance testing using Lucene's benchmark module on a megabyte document
> showed >5x speedup, in conjunction with some other patches to be posted
> separately. This patch made the most difference.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]