Nikolay Khitrin created LUCENE-8178:
---------------------------------------
Summary: Bulk operations for LongValues and Sorted[Set]DocValues
Key: LUCENE-8178
URL: https://issues.apache.org/jira/browse/LUCENE-8178
Project: Lucene - Core
Issue Type: Improvement
Affects Versions: 7.2.1
Reporter: Nikolay Khitrin
One-by-one DocValues iteration by {{advanceExact}} and {{nextOrd}}/{{ordValue}}
is really slow for bulk operations like facetting. Reading and unpacking
integers in blocks is substantially faster but DocValues for now can be queried
only for single document.
To apply document-based bulk processing {{DocIdSetIterator}} matches have to be
splitted to sequential docID runs and remapped to underlying {{LongValues}}
positions.
After this transformation relatively large linear scans can be performed over
packed integers.
To do this two new interfaces
1. {{LongValuesCollector}} ({{collectValue(long index, long value)}}).
2. {{OrdStatsCollector}} ({{collectOrd(long ord)}}, {{collectMissing(int
count)}}).
and three new functions are introduced
1. {{LongValues.forRange(long begin, long end, LongValuesCollector collector)}}
2. {{SortedDocValues.forEach(DocIdSetIterator disi, OrdStatsConsumer
collector)}}
3. {{SortedSetDocValues.forEach(DocIdSetIterator disi, OrdStatsConsumer
collector)}}
with reference implementations.
Optimized versions of these functions are provided for:
1. {{DirectReader}} for non-32/64 bits per value cases (using
{{PackedInts.Decoder}}).
2. {{Lucene70DocValuesProducer}} {{getSorted}} and {{getSortedSet}} (both
sparse and dense).
Measured Solr facetting performance boost is up to 2 - 2.5x on real index.
Patch for Solr {{DocValuesFacets}} is also provided as separate file.
Implementation notes:
* {{OrdStatsCollector}} does not accept document id because it will ruin
performance for {{SortedSetDocValues}} due to excessive position lookups.
* This patch is fully compatible with Lucene 7.0 DocValues format.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]