[
https://issues.apache.org/jira/browse/LUCENE-6863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14980687#comment-14980687
]
Adrien Grand commented on LUCENE-6863:
--------------------------------------
I ran some benchmarks with the geoname dataset which has a few sparse fields:
- cc2: 3.2% of documents have this field, which has 573 unique values
- admin4: 4.3% of documents have this field, which has 102950 unique values
- admin3: 10.2% of documents have this field, which has 73120 unique values
- admin2: 45.3% of documents have this field, which has 30603 unique values
First I enabled sparse compression on all fields, regardless of density to see
how this compares to the delta compression that we use by default, and then ran
two kinds of queries:
- queries on a random partition of the index, which I guess would be the case
when you have true sparse fields
- a query only on documents that have a value, which I guess would be more
realistic if you store several types of data in the same index that don't have
the same fields
||Field||disk usage for ordinals||memory usage with sparse compression
enabled||sort performance on a MatchAllDocsQuery||sort performance on a term
query that matches 10% of docs||sort performance on a term query that matches
1% of docs||sort performance on a term query that matches docs that have the
field||
|cc2 | -88%|1680 bytes|-27%|+25%|+58%|+208%|
|admin4|-86%|568 bytes|-20%|+7%|-20%|+214%|
|admin3|-67%|1312 bytes|+11%|+57%|+42%|+236%|
|admin2 |+17%|2904 bytes|+132%|+275%|+331%|+221%|
The reduction in disk usage is significant, but so is the slowdown, especially
when running a query that only matches docs that have a value. However memory
usage looks acceptable to me for 10M docs.
I couldn't test with 3% as even the rarest field is contained by 3.2% of
documents, but I updated the heuristic to require at least 1024 docs in the
segment (like Robert suggested) and that less than 5% of docs have a value:
||Field||memory usage due to sparse compression||sort performance on a
MatchAllDocsQuery||sort performance on a term query that matches 10% of
docs||sort performance on a term query that matches 1% of docs||sort
performance on a term query that matches docs that have the field||
|cc2 | 1680 bytes|-10%|+34%|+62%|+214%|
|admin4|568 bytes|-7%|+20%|-14%|+241%|
|admin3|576 bytes|+9%|+7%|+11%|+10%|
|admin2 |1008 bytes|+1%|+8%|+9%|+11%|
To my surprise, admin2 and admin3 were still using sparse compression on some
segments. The reason is that documents with sparse values are not uniform in
the dataset but rather clustered: I suspect this partially explains of the
slowdown for admin2/admin3, maybe there is also hotspot not liking having more
impls to deal with.
> Store sparse doc values more efficiently
> ----------------------------------------
>
> Key: LUCENE-6863
> URL: https://issues.apache.org/jira/browse/LUCENE-6863
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Attachments: LUCENE-6863.patch
>
>
> For both NUMERIC fields and ordinals of SORTED fields, we store data in a
> dense way. As a consequence, if you have only 1000 documents out of 1B that
> have a value, and 8 bits are required to store those 1000 numbers, we will
> not require 1KB of storage, but 1GB.
> I suspect this mostly happens in abuse cases, but still it's a pity that we
> explode storage requirements. We could try to detect sparsity and compress
> accordingly.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]