[
https://issues.apache.org/jira/browse/LUCENE-8178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nikolay Khitrin updated LUCENE-8178:
------------------------------------
Attachment: graph.png
> Bulk operations for LongValues and Sorted[Set]DocValues
> -------------------------------------------------------
>
> Key: LUCENE-8178
> URL: https://issues.apache.org/jira/browse/LUCENE-8178
> Project: Lucene - Core
> Issue Type: Improvement
> Affects Versions: 7.2.1
> Reporter: Nikolay Khitrin
> Priority: Major
> Attachments: LUCENE-8178-for-solr.patch, LUCENE-8178.patch, graph.png
>
>
> One-by-one DocValues iteration by {{advanceExact}} and
> {{nextOrd}}/{{ordValue}} is really slow for bulk operations like facetting.
> Reading and unpacking integers in blocks is substantially faster but
> DocValues for now can be queried only for single document.
> To apply document-based bulk processing {{DocIdSetIterator}} matches have to
> be splitted to sequential docID runs and remapped to underlying
> {{LongValues}} positions.
> After this transformation relatively large linear scans can be performed
> over packed integers.
>
> To do this two new interfaces
> 1. {{LongValuesCollector}} ({{collectValue(long index, long value)}}).
> 2. {{OrdStatsCollector}} ({{collectOrd(long ord)}}, {{collectMissing(int
> count)}}).
> and three new functions are introduced
> 1. {{LongValues.forRange(long begin, long end, LongValuesCollector
> collector)}}
> 2. {{SortedDocValues.forEach(DocIdSetIterator disi, OrdStatsConsumer
> collector)}}
> 3. {{SortedSetDocValues.forEach(DocIdSetIterator disi, OrdStatsConsumer
> collector)}}
> with reference implementations.
> Optimized versions of these functions are provided for:
> 1. {{DirectReader}} for non-32/64 bits per value cases (using
> {{PackedInts.Decoder}}).
> 2. {{Lucene70DocValuesProducer}} {{getSorted}} and {{getSortedSet}} (both
> sparse and dense).
>
> Measured Solr facetting performance boost is up to 2 - 2.5x on real index.
> Patch for Solr {{DocValuesFacets}} is also provided as separate file.
>
> Implementation notes:
> * {{OrdStatsCollector}} does not accept document id because it will ruin
> performance for {{SortedSetDocValues}} due to excessive position lookups.
> * This patch is fully compatible with Lucene 7.0 DocValues format.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]