[jira] [Created] (LUCENE-6100) Further tuning of Lucene50Codec(BEST_COMPRESSION)

Robert Muir (JIRA) Sun, 07 Dec 2014 09:37:31 -0800

Robert Muir created LUCENE-6100:
-----------------------------------

             Summary: Further tuning of Lucene50Codec(BEST_COMPRESSION)
                 Key: LUCENE-6100
                 URL: https://issues.apache.org/jira/browse/LUCENE-6100
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Robert Muir



Currently this codec has two options: BEST_SPEED and BEST_COMPRESSION. But in 
the case of highly compressible data, the ratio for BEST_COMPRESSION is not 
much over BEST_SPEED, because they share the same underlying format which is 
not optimized for this here.

block size is currently 24576 (32kb sliding window size minus 8kb "grace" to 
avoid going over it). And we compress this in a stateless manner, each block is 
its own stream and they dont share preset dictionary or anything. So we have a 
lot of waste in many cases, since zlib has to reboot itself, then we generally 
throw away 1/4 of the window and start over.

I ran some experiments with highly compressible logs data:
||method||time indexing(ms)||time merging(ms)||fdt||fdx||
|BEST_SPEED|101,729|15,638|372,845,282|406,964|
|BEST_COMPRESSION|114,364|23,474|269,387,347|275.909|
|patch (60KB)|105,533|18,914|237,284,342|117,639|

The other experiments I ran were:
||method||time indexing(ms)||time merging(ms)||fdt||fdx||
|crappy preset|130,854|38,095|234,603,971|274,500|
|64KB|107,256|21,570|236,004,297|111,135|
|crappy preset+64KB|121,503|30,030|222,422,924|110,751|

For 'crappy preset' I just use arbitrary first 32KB bytes of original data as a 
preset dictionary for every block. This is effective, but slow because of some 
unnecessary overhead involved (like computing adler32 over and over of the 
preset dict for each block). However, this overhead is reduced with larger 
block sizes, and still offers benefits, so maybe in the future we can do it 
(especially e.g. if its per-chunk and we can bulk merge chunks without 
recompressing, etc).

For 64KB, we measure removing the "grace" completely so it spills to another 
block each time. The proposed smaller "grace" amount still offers cpu savings, 
so I think we should keep it. But its not terrible if you go over.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (LUCENE-6100) Further tuning of Lucene50Codec(BEST_COMPRESSION)

Reply via email to