[
https://issues.apache.org/jira/browse/LUCENE-6764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725582#comment-14725582
]
Paul Elschot commented on LUCENE-6764:
--------------------------------------
bq. store these payloads on a couple of bits
An EliasFanoSequence can do just that and is indexable by position.
The sequence is normally non decreasing, so for random (small) numbers one
should encode their cumulate sums.
>From the javadocs as patched, here numValues is the number of positions with
>payloads, and the upperBound is the sum of the payloads:
{noformat}
+ * The Elias-Fano encoding uses at most
+ * <p>
+ * <code>2 + ceil(log(upperBound/numValues))</code>
+ * <p>
+ * bits per encoded number. With <code>upperBound</code> in these bounds
(<code>p</code> is an integer):
+ * <p>
+ * {@code 2**p < x[numValues-1] <= upperBound <= 2**(p+1)}
+ * <p>
+ * the number of bits per encoded number is minimized.
{noformat}
The EliasFanoBytes can be used as a single payload per document (as currently
at LUCENE-5627), or maybe better as a docvalue.
For now this is only efficiently indexable by value (to implement advancing in
a DocIdSet).
Efficient indexing by position (index in the sequence) can be easily added.
These indexes have one entry per 256 values, so there size overhead is quite
small.
> Payloads should be compressed
> -----------------------------
>
> Key: LUCENE-6764
> URL: https://issues.apache.org/jira/browse/LUCENE-6764
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Priority: Minor
>
> I think we should at least try to do something simple, eg. deduplicate or
> apply simple LZ77 compression. For instance if you use enclosing html tags to
> give different weights to individual terms, there might be lots of
> repetitions as there are not that many unique html tags.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]