Robert Muir created LUCENE-6100:
-----------------------------------
Summary: Further tuning of Lucene50Codec(BEST_COMPRESSION)
Key: LUCENE-6100
URL: https://issues.apache.org/jira/browse/LUCENE-6100
Project: Lucene - Core
Issue Type: Improvement
Reporter: Robert Muir
Currently this codec has two options: BEST_SPEED and BEST_COMPRESSION. But in
the case of highly compressible data, the ratio for BEST_COMPRESSION is not
much over BEST_SPEED, because they share the same underlying format which is
not optimized for this here.
block size is currently 24576 (32kb sliding window size minus 8kb "grace" to
avoid going over it). And we compress this in a stateless manner, each block is
its own stream and they dont share preset dictionary or anything. So we have a
lot of waste in many cases, since zlib has to reboot itself, then we generally
throw away 1/4 of the window and start over.
I ran some experiments with highly compressible logs data:
||method||time indexing(ms)||time merging(ms)||fdt||fdx||
|BEST_SPEED|101,729|15,638|372,845,282|406,964|
|BEST_COMPRESSION|114,364|23,474|269,387,347|275.909|
|patch (60KB)|105,533|18,914|237,284,342|117,639|
The other experiments I ran were:
||method||time indexing(ms)||time merging(ms)||fdt||fdx||
|crappy preset|130,854|38,095|234,603,971|274,500|
|64KB|107,256|21,570|236,004,297|111,135|
|crappy preset+64KB|121,503|30,030|222,422,924|110,751|
For 'crappy preset' I just use arbitrary first 32KB bytes of original data as a
preset dictionary for every block. This is effective, but slow because of some
unnecessary overhead involved (like computing adler32 over and over of the
preset dict for each block). However, this overhead is reduced with larger
block sizes, and still offers benefits, so maybe in the future we can do it
(especially e.g. if its per-chunk and we can bulk merge chunks without
recompressing, etc).
For 64KB, we measure removing the "grace" completely so it spills to another
block each time. The proposed smaller "grace" amount still offers cpu savings,
so I think we should keep it. But its not terrible if you go over.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]