[
https://issues.apache.org/jira/browse/LUCENE-7589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adrien Grand updated LUCENE-7589:
---------------------------------
Attachment: LUCENE-7589.patch
Here is a patch. The doc values consumer computes space usage both for the case
that all values use the same number of bits per value and for the case that
values are split into blocks of 16384 values. And if using blocks proves to
save 10% disk usage or more, then it encodes blocks with their own required
number of bits per value.
I kept a rather high value of the block size, since this impl can only jump
forward {{blockSize}} documents at a time, so a high value like 16384 hopefully
keeps performance good, but in the future we might want to look into leveraging
the sequential access pattern even more (to do run-length encoding for
instance) and maybe have eg. a skip list to handle the big jumps, like postings
do. I think that patch is a good first (baby) step towards that direction.
> Prevent outliers from raising the number of bits of everyone with numeric doc
> values
> ------------------------------------------------------------------------------------
>
> Key: LUCENE-7589
> URL: https://issues.apache.org/jira/browse/LUCENE-7589
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Priority: Minor
> Attachments: LUCENE-7589.patch
>
>
> Today we encode entire segments with a single number of bits per value. It
> was done this way because it was faster, but it also means a single outlier
> can significantly increase the space requirements. I think we should have
> protection against that.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]