[jira] [Commented] (LUCENE-5914) More options for stored fields compression

Adrien Grand (JIRA) Wed, 03 Sep 2014 08:08:59 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14119918#comment-14119918
 ]


Adrien Grand commented on LUCENE-5914:
--------------------------------------

bq. Ideally we should enable to use biggish chunk_size during compression to 
improve compression and decompress only single document (not depending on 
chunk_size), just like you proposed here (if I figured it out correctly?)

Exactly, this is one of the two proposed options. The only overhead would be 
that you would need to read the shared dictionary and have it in memory (but 
that is a single call to readBytes and its size can be controlled so that 
should be no issue).

bq. Usually, such data is highly compressible (imagine all these low 
cardinality fields like color of something...) and even some basic compression 
does the magic.

Agreed, this is the reason why I'd prefer the "low-overhead" option to be 
something cheap rather than no compression at all: data usually has lots of 
patterns and even something as simple as LZ4 manages to reach interesting 
compression ratios.

{quote}
Conclusion: compression is great, and anything that helps tweak this balance 
(CPU effort / IO effort) in different phases indexing/retrieving smoothly makes 
lucene use case coverage broader. (e.g. "I want to afford more CPU during 
indexing, and less CPU during retrieval", static coder being extreme case for 
this...)

I am not sure I figured out exactly if and how this patch is going to help in a 
such cases (how to achieve reasonable compression if we do per document 
compression for small documents? Reusing dictionaries from previous chunks? 
static dictionaries... ). 
{quote}

The trade-off that this patch makes is:
 - keep indexing fast enough in all cases
 - allow to trade random access speed to documents for index compression

The current patch provides two options:
  * either we compress documents in blocks like today but with Deflate instead 
of LZ4, this provides good compression ratios but makes random access quite 
slow since you need to decompress a whole block of documents every time you 
want to access a single document
  * either we still group documents into blocks but compress them individually, 
using the compressed representation of the previous documents as a dictionary.

I'll try to explain the 2nd option better: it works well because lz4 mostly 
deduplicates sequences of bytes in a stream. So imagine that you have the 
following 3 documents in a block:
 1. abcdefghabcdwxyz
 2. abcdefghijkl
 3. abcdijklmnop

We will first compress document 1. Given that it is the first document in the 
block, there is no shared dictionary, so the compressed representation look 
like this (`literals` means that bytes are copied as-is, and `ref` means it is 
a reference to a previous sequence of bytes. This is how lz4 works, it just 
replaces existing sequences of bytes with references to previous occurrences of 
the same bytes. The more references you have and the longer they are, the 
better the compression ratio.).

<literals:abcdefgh><ref:abcd><literals:wxyz>

Now we are going to compress document 2. It doesn't contain any repetition of 
bytes, so if we wanted to compress it individually, we would just have 
<literals:abcdefghijkl> which doesn't compress at all (and is even slightly 
larger due to the overhead of the format). However, we are using the compressed 
representation of document1 as a dictionary, and "abcdefgh" exists in the 
literals, so we can make a reference to it!

<ref:abcdefgh><literals:ijkl>

And again for document3 using literals of document1 for "abcd", and literals of 
document2 for "ijkl":

<ref:abcd><ref:ijkl><literals:mnop>

> More options for stored fields compression
> ------------------------------------------
>
>                 Key: LUCENE-5914
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5914
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>             Fix For: 4.11
>
>         Attachments: LUCENE-5914.patch
>
>
> Since we added codec-level compression in Lucene 4.1 I think I got about the 
> same amount of users complaining that compression was too aggressive and that 
> compression was too light.
> I think it is due to the fact that we have users that are doing very 
> different things with Lucene. For example if you have a small index that fits 
> in the filesystem cache (or is close to), then you might never pay for actual 
> disk seeks and in such a case the fact that the current stored fields format 
> needs to over-decompress data can sensibly slow search down on cheap queries.
> On the other hand, it is more and more common to use Lucene for things like 
> log analytics, and in that case you have huge amounts of data for which you 
> don't care much about stored fields performance. However it is very 
> frustrating to notice that the data that you store takes several times less 
> space when you gzip it compared to your index although Lucene claims to 
> compress stored fields.
> For that reason, I think it would be nice to have some kind of options that 
> would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5914) More options for stored fields compression

Reply via email to