[ 
https://issues.apache.org/jira/browse/IMPALA-12108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18013539#comment-18013539
 ] 

ASF subversion and git services commented on IMPALA-12108:
----------------------------------------------------------

Commit 7477107ca370ef4b0709855522f8d955e47edbec in impala's branch 
refs/heads/master from Joe McDonnell
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=7477107ca ]

IMPALA-12108: Add support for LZ4 high compression levels

LZ4 has a high compression mode that gets higher compression ratios
(at the cost of higher compression time) while maintaining the fast
decompression speed. This type of compression would be useful for
workloads that write data once and read it many times.

This adds support for specifying a compression level for the
LZ4 codec. Compression level 1 is the current fast API. Compression
levels between LZ4HC_CLEVEL_MIN (3) and LZ4HC_CLEVEL_MAX (12) use
the high compression API. This lines up with the behavior of the lz4
commandline.

TPC-H 42 scale comparison
Compression codec | Avg Time (s) | Geomean Time (s) | Lineitem Size (GB) | 
Compression time for lineitem (s)
------------------+--------------+------------------+--------------------+------------------------------
Snappy            | 2.75         | 2.08             | 8.76               |   
7.436
LZ4 level 1       | 2.58         | 1.91             | 9.1                |   
6.864
LZ4 level 3       | 2.58         | 1.93             | 7.9                |  
43.918
LZ4 level 9       | 2.68         | 1.98             | 7.6                | 125.0
Zstd level 3      | 3.03         | 2.31             | 6.36               |  
17.274
Zstd level 6      | 3.10         | 2.38             | 6.33               |  
44.955

LZ4 level 3 is about 10% smaller in data size while being about as fast as
regular LZ4. It compresses at about the same speed as Zstd level 6.

Testing:
 - Ran perf-AB-test with lz4 high compression levels
 - Added test cases to decompress-test

Change-Id: Ie7470ce38b8710c870cacebc80bc02cf5d022791
Reviewed-on: http://gerrit.cloudera.org:8080/23254
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Add support for writing data with LZ4's high compression mode
> -------------------------------------------------------------
>
>                 Key: IMPALA-12108
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12108
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>    Affects Versions: Impala 4.3.0
>            Reporter: Joe McDonnell
>            Assignee: Joe McDonnell
>            Priority: Major
>              Labels: ramp-up
>
> LZ4 has a high compression mode that gets higher compression ratios than 
> Snappy while maintaining high decompression speeds. The tradeoff is that 
> compression is very slow. We should add support for writing data with LZ4 
> high compression mode. This would let us get a sense of the performance for 
> writing and reading.
> See this benchmark on the LZ4 page:
> https://github.com/lz4/lz4#benchmarks
> In my hand tests, Parquet/LZ4 is about 13% smaller than Parquet/Snappy, but 
> it retains the fast decompression.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to