[
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14119918#comment-14119918
]
Adrien Grand commented on LUCENE-5914:
--------------------------------------
bq. Ideally we should enable to use biggish chunk_size during compression to
improve compression and decompress only single document (not depending on
chunk_size), just like you proposed here (if I figured it out correctly?)
Exactly, this is one of the two proposed options. The only overhead would be
that you would need to read the shared dictionary and have it in memory (but
that is a single call to readBytes and its size can be controlled so that
should be no issue).
bq. Usually, such data is highly compressible (imagine all these low
cardinality fields like color of something...) and even some basic compression
does the magic.
Agreed, this is the reason why I'd prefer the "low-overhead" option to be
something cheap rather than no compression at all: data usually has lots of
patterns and even something as simple as LZ4 manages to reach interesting
compression ratios.
{quote}
Conclusion: compression is great, and anything that helps tweak this balance
(CPU effort / IO effort) in different phases indexing/retrieving smoothly makes
lucene use case coverage broader. (e.g. "I want to afford more CPU during
indexing, and less CPU during retrieval", static coder being extreme case for
this...)
I am not sure I figured out exactly if and how this patch is going to help in a
such cases (how to achieve reasonable compression if we do per document
compression for small documents? Reusing dictionaries from previous chunks?
static dictionaries... ).
{quote}
The trade-off that this patch makes is:
- keep indexing fast enough in all cases
- allow to trade random access speed to documents for index compression
The current patch provides two options:
* either we compress documents in blocks like today but with Deflate instead
of LZ4, this provides good compression ratios but makes random access quite
slow since you need to decompress a whole block of documents every time you
want to access a single document
* either we still group documents into blocks but compress them individually,
using the compressed representation of the previous documents as a dictionary.
I'll try to explain the 2nd option better: it works well because lz4 mostly
deduplicates sequences of bytes in a stream. So imagine that you have the
following 3 documents in a block:
1. abcdefghabcdwxyz
2. abcdefghijkl
3. abcdijklmnop
We will first compress document 1. Given that it is the first document in the
block, there is no shared dictionary, so the compressed representation look
like this (`literals` means that bytes are copied as-is, and `ref` means it is
a reference to a previous sequence of bytes. This is how lz4 works, it just
replaces existing sequences of bytes with references to previous occurrences of
the same bytes. The more references you have and the longer they are, the
better the compression ratio.).
<literals:abcdefgh><ref:abcd><literals:wxyz>
Now we are going to compress document 2. It doesn't contain any repetition of
bytes, so if we wanted to compress it individually, we would just have
<literals:abcdefghijkl> which doesn't compress at all (and is even slightly
larger due to the overhead of the format). However, we are using the compressed
representation of document1 as a dictionary, and "abcdefgh" exists in the
literals, so we can make a reference to it!
<ref:abcdefgh><literals:ijkl>
And again for document3 using literals of document1 for "abcd", and literals of
document2 for "ijkl":
<ref:abcd><ref:ijkl><literals:mnop>
> More options for stored fields compression
> ------------------------------------------
>
> Key: LUCENE-5914
> URL: https://issues.apache.org/jira/browse/LUCENE-5914
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Fix For: 4.11
>
> Attachments: LUCENE-5914.patch
>
>
> Since we added codec-level compression in Lucene 4.1 I think I got about the
> same amount of users complaining that compression was too aggressive and that
> compression was too light.
> I think it is due to the fact that we have users that are doing very
> different things with Lucene. For example if you have a small index that fits
> in the filesystem cache (or is close to), then you might never pay for actual
> disk seeks and in such a case the fact that the current stored fields format
> needs to over-decompress data can sensibly slow search down on cheap queries.
> On the other hand, it is more and more common to use Lucene for things like
> log analytics, and in that case you have huge amounts of data for which you
> don't care much about stored fields performance. However it is very
> frustrating to notice that the data that you store takes several times less
> space when you gzip it compared to your index although Lucene claims to
> compress stored fields.
> For that reason, I think it would be nice to have some kind of options that
> would allow to trade speed for compression in the default codec.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]