[
https://issues.apache.org/jira/browse/SOLR-5855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Smiley reassigned SOLR-5855:
----------------------------------
Assignee: David Smiley
> Increasing solr highlight performance with caching
> --------------------------------------------------
>
> Key: SOLR-5855
> URL: https://issues.apache.org/jira/browse/SOLR-5855
> Project: Solr
> Issue Type: Improvement
> Components: highlighter
> Affects Versions: Trunk
> Reporter: Daniel Debray
> Assignee: David Smiley
> Fix For: Trunk
>
> Attachments: SOLR-5855-without-cache.patch, highlight.patch
>
>
> Hi folks,
> while investigating possible performance bottlenecks in the highlight
> component i discovered two places where we can save some cpu cylces.
> Both are in the class org.apache.solr.highlight.DefaultSolrHighlighter
> First in method doHighlighting (lines 411-417):
> In the loop we try to highlight every field that has been resolved from the
> params on each document. Ok, but why not skip those fields that are not
> present on the current document?
> So i changed the code from:
> for (String fieldName : fieldNames) {
> fieldName = fieldName.trim();
> if( useFastVectorHighlighter( params, schema, fieldName ) )
> doHighlightingByFastVectorHighlighter( fvh, fieldQuery, req,
> docSummaries, docId, doc, fieldName );
> else
> doHighlightingByHighlighter( query, req, docSummaries, docId, doc,
> fieldName );
> }
> to:
> for (String fieldName : fieldNames) {
> fieldName = fieldName.trim();
> if (doc.get(fieldName) != null) {
> if( useFastVectorHighlighter( params, schema, fieldName ) )
> doHighlightingByFastVectorHighlighter( fvh, fieldQuery, req,
> docSummaries, docId, doc, fieldName );
> else
> doHighlightingByHighlighter( query, req, docSummaries, docId, doc,
> fieldName );
> }
> }
> The second place is where we try to retrieve the TokenStream from the
> document for a specific field.
> line 472:
> TokenStream tvStream =
> TokenSources.getTokenStreamWithOffsets(searcher.getIndexReader(), docId,
> fieldName);
> where..
> public static TokenStream getTokenStreamWithOffsets(IndexReader reader, int
> docId, String field) throws IOException {
> Fields vectors = reader.getTermVectors(docId);
> if (vectors == null) {
> return null;
> }
> Terms vector = vectors.terms(field);
> if (vector == null) {
> return null;
> }
> if (!vector.hasPositions() || !vector.hasOffsets()) {
> return null;
> }
> return getTokenStream(vector);
> }
> keep in mind that we currently hit the IndexReader n times where n =
> requested rows(documents) * requested amount of highlight fields.
> in my usecase reader.getTermVectors(docId) takes around 150.000~250.000ns on
> a warm solr and 1.100.000ns on a cold solr.
> If we store the returning Fields vectors in a cache, this lookups only take
> 25000ns.
> I would suggest something like the following code in the
> doHighlightingByHighlighter method in the DefaultSolrHighlighter class (line
> 472):
> Fields vectors = null;
> SolrCache termVectorCache = searcher.getCache("termVectorCache");
> if (termVectorCache != null) {
> vectors = (Fields) termVectorCache.get(Integer.valueOf(docId));
> if (vectors == null) {
> vectors = searcher.getIndexReader().getTermVectors(docId);
> if (vectors != null) termVectorCache.put(Integer.valueOf(docId), vectors);
> }
> } else {
> vectors = searcher.getIndexReader().getTermVectors(docId);
> }
> TokenStream tvStream = TokenSources.getTokenStreamWithOffsets(vectors,
> fieldName);
> and TokenSources class:
> public static TokenStream getTokenStreamWithOffsets(Fields vectors, String
> field) throws IOException {
> if (vectors == null) {
> return null;
> }
> Terms vector = vectors.terms(field);
> if (vector == null) {
> return null;
> }
> if (!vector.hasPositions() || !vector.hasOffsets()) {
> return null;
> }
> return getTokenStream(vector);
> }
> 4000ms on 1000 docs without cache
> 639ms on 1000 docs with cache
> 102ms on 30 docs without cache
> 22ms on 30 docs with cache
> on an index with 190.000 docs with a numFound of 32000 and 80 different
> highlight fields.
> I think querys with only one field to highlight on a document does not
> benefit that much from a cache like this, thats why i think an optional cache
> would be the best solution there.
> As i saw the FastVectorHighlighter uses more or less the same approach and
> could also benefit from this cache.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]