[jira] [Updated] (LUCENE-8178) Bulk operations for LongValues and Sorted[Set]DocValues

Nikolay Khitrin (JIRA) Tue, 20 Feb 2018 08:05:42 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-8178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Nikolay Khitrin updated LUCENE-8178:
------------------------------------
    Attachment: graph.png

> Bulk operations for LongValues and Sorted[Set]DocValues
> -------------------------------------------------------
>
>                 Key: LUCENE-8178
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8178
>             Project: Lucene - Core
>          Issue Type: Improvement
>    Affects Versions: 7.2.1
>            Reporter: Nikolay Khitrin
>            Priority: Major
>         Attachments: LUCENE-8178-for-solr.patch, LUCENE-8178.patch, graph.png
>
>
> One-by-one DocValues iteration by {{advanceExact}} and 
> {{nextOrd}}/{{ordValue}} is really slow for bulk operations like facetting. 
> Reading and unpacking integers in blocks is substantially faster but 
> DocValues for now can be queried only for single document.
> To apply document-based bulk processing {{DocIdSetIterator}} matches have to 
> be splitted to sequential docID runs and remapped to underlying 
> {{LongValues}} positions.
>  After this transformation relatively large linear scans can be performed 
> over packed integers.
>  
> To do this two new interfaces
> 1. {{LongValuesCollector}} ({{collectValue(long index, long value)}}).
>  2. {{OrdStatsCollector}} ({{collectOrd(long ord)}}, {{collectMissing(int 
> count)}}).
> and three new functions are introduced
> 1. {{LongValues.forRange(long begin, long end, LongValuesCollector 
> collector)}}
>  2. {{SortedDocValues.forEach(DocIdSetIterator disi, OrdStatsConsumer 
> collector)}}
>  3. {{SortedSetDocValues.forEach(DocIdSetIterator disi, OrdStatsConsumer 
> collector)}}
> with reference implementations.
> Optimized versions of these functions are provided for:
>  1. {{DirectReader}} for non-32/64 bits per value cases (using 
> {{PackedInts.Decoder}}).
>  2. {{Lucene70DocValuesProducer}} {{getSorted}} and {{getSortedSet}} (both 
> sparse and dense).
>  
> Measured Solr facetting performance boost is up to 2 - 2.5x on real index.
>  Patch for Solr {{DocValuesFacets}} is also provided as separate file.
>  
> Implementation notes:
>  * {{OrdStatsCollector}} does not accept document id because it will ruin 
> performance for {{SortedSetDocValues}} due to excessive position lookups.
>  * This patch is fully compatible with Lucene 7.0 DocValues format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-8178) Bulk operations for LongValues and Sorted[Set]DocValues

Reply via email to