[
https://issues.apache.org/jira/browse/LUCENE-7253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Otis Gospodnetic updated LUCENE-7253:
-------------------------------------
Fix Version/s: master (7.0)
> Make sparse doc values and segments merging more efficient
> -----------------------------------------------------------
>
> Key: LUCENE-7253
> URL: https://issues.apache.org/jira/browse/LUCENE-7253
> Project: Lucene - Core
> Issue Type: Improvement
> Affects Versions: 5.5, 6.0
> Reporter: Pawel Rog
> Labels: performance
> Fix For: master (7.0)
>
>
> Doc Values were optimized recently to efficiently store sparse data.
> Unfortunately there is still big problem with Doc Values merges for sparse
> fields. When we imagine 1 billion documents index it seems it doesn't matter
> if all documents have value for this field or there is only 1 document with
> value. Segment merge time is the same for both cases. In most cases this is
> not a problem but there are several cases in which one can expect having many
> fields with sparse doc values.
> I can describe an example. During performance tests of a system with large
> number of sparse fields I realized that Doc Values merges are a bottleneck. I
> had hundreds of different numeric fields. Each document contained only small
> subset of all fields. Average document contains 5-7 different numeric values.
> As you can see data was very sparse in these fields. It turned out that
> ingestion process was CPU-bound. Most of CPU time was spent in DocValues
> related methods (SingletonSortedNumericDocValues#setDocument,
> DocValuesConsumer$10$1#next, DocValuesConsumer#isSingleValued,
> DocValuesConsumer$4$1#setNext, ...) - mostly during merging segments.
> Adrien Grand suggested to reduce the number of sparse fields and replace them
> with smaller number of denser fields. This helped a lot but complicated
> fields naming.
> I am not much familiar with Doc Values source code but I have small
> suggestion how to improve Doc Values merges for sparse fields. I realized
> that Doc Values producers and consumers use Iterators. Let's take an example
> of numeric Doc Values. Would it be possible to replace Iterator which
> "travels" through all documents with Iterator over collection of non empty
> values? Of course this would require storing object (instead of numeric)
> which contains value and document ID. Such an iterator could significantly
> improve merge time of sparse Doc Values fields. IMHO this won't cause big
> overhead for dense structures but it can be game changer for sparse
> structures.
> This is what happens in NumericDocValuesWriter on flush
> {code}
> dvConsumer.addNumericField(fieldInfo,
> new Iterable<Number>() {
> @Override
> public Iterator<Number> iterator() {
> return new NumericIterator(maxDoc, values,
> docsWithField);
> }
> });
> {code}
> Before this happens during addValue, this loop is executed to fill holes.
> {code}
> // Fill in any holes:
> for (int i = (int)pending.size(); i < docID; ++i) {
> pending.add(MISSING);
> }
> {code}
> It turns out that variable called pending is used only internally in
> NumericDocValuesWriter. I know pending is PackedLongValues and it wouldn't be
> good to change it with different class (some kind of list) because this may
> break DV performance for dense fields. I hope someone can suggest interesting
> solutions for this problem :).
> It would be great if discussion about sparse Doc Values merge performance can
> start here.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]