German Florez-Larrahondo created HADOOP-9785: ------------------------------------------------
Summary: LZ4 code may need upgrade (lz4.c embedded in libHadoop is r43 18 months ago, while latest version is r98) Key: HADOOP-9785 URL: https://issues.apache.org/jira/browse/HADOOP-9785 Project: Hadoop Common Issue Type: Improvement Components: io, native Affects Versions: 2.0.4-alpha, 3.0.0 Environment: [german@localhost lz4-read-only]$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 23 Stepping: 10 CPU MHz: 2667.000 BogoMIPS: 5319.82 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 2048K NUMA node0 CPU(s): 0-3 [german@localhost lz4-read-only]$ uname -r 2.6.32-358.14.1.el6.x86_64 Reporter: German Florez-Larrahondo Priority: Minor Fix For: 3.0.0, 2.0.4-alpha While analyzing compression performance of different Hadoop codecs I noticed that the LZ4 code was taken from revision 43 of https://code.google.com/p/lz4/. The latest version is r98 and there may be extra performance benefits we can gain from using r98. We may involve the original LZ4 author Yann Collet on these discussions, as the current LZ4 code includes additional algorithms and parameters. To start the investigation, I ran preliminary experiments with the Silesia corpus and there seems to be an improvement on throughput for compression and decompression in the latest release when compared with r43 (haven't done enough analysis to conclude anything statistically, but looks good). Here is raw output using LZ4 from r43 with a SUBSET of the silesia corpus (http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia) File: silesia/dickens *** Compression CLI using LZ4 algorithm , by Yann Collet (Jul 29 2013) *** Compressed 10192446 bytes into 6433123 bytes ==> 63.12% Done in 0.07 s ==> 138.86 MB/s *** Compression CLI using LZ4 algorithm , by Yann Collet (Jul 29 2013) *** Successfully decoded 10192446 bytes Done in 0.02 s ==> 486.01 MB/s File: silesia/mozilla *** Compression CLI using LZ4 algorithm , by Yann Collet (Jul 29 2013) *** Compressed 51220480 bytes into 26379814 bytes ==> 51.50% Done in 0.25 s ==> 195.39 MB/s *** Compression CLI using LZ4 algorithm , by Yann Collet (Jul 29 2013) *** Successfully decoded 51220480 bytes Done in 0.12 s ==> 407.06 MB/s File: silesia/mr *** Compression CLI using LZ4 algorithm , by Yann Collet (Jul 29 2013) *** Compressed 9970564 bytes into 5669268 bytes ==> 56.86% Done in 0.04 s ==> 237.72 MB/s *** Compression CLI using LZ4 algorithm , by Yann Collet (Jul 29 2013) *** Successfully decoded 9970564 bytes Done in 0.02 s ==> 475.43 MB/s File: silesia/nci *** Compression CLI using LZ4 algorithm , by Yann Collet (Jul 29 2013) *** Compressed 33553445 bytes into 5880292 bytes ==> 17.53% Done in 0.08 s ==> 399.99 MB/s *** Compression CLI using LZ4 algorithm , by Yann Collet (Jul 29 2013) *** Successfully decoded 33553445 bytes Done in 0.06 s ==> 533.32 MB/s And here raw output of LZ4 from the latest release r98 File: silesia/dickens *** Full LZ4 speed analyzer , by Yann Collet (Jul 29 2013) *** Loading silesia/dickens... 1-LZ4_compress : 10192446 ->^M1-LZ4_compress : 10192446 -> 6434313 (63.13%), 172.3 MB/s 1-LZ4_decompress_fast : 10192446 ->^M1-LZ4_decompress_fast : 10192446 -> 676.0 MB/s^MLZ4_decompress_fast : 10192446 -> 676.0 MB/s File: silesia/mozilla *** Full LZ4 speed analyzer , by Yann Collet (Jul 29 2013) *** Loading silesia/mozilla... 1-LZ4_compress : 51220480 ->^M1-LZ4_compress : 51220480 -> 26382113 (51.51%), 281.7 MB/s 1-LZ4_decompress_fast : 51220480 ->^M1-LZ4_decompress_fast : 51220480 -> 1003.1 MB/s^MLZ4_decompress_fast : 51220480 -> 1003.1 MB/s File: silesia/mr *** Full LZ4 speed analyzer , by Yann Collet (Jul 29 2013) *** Loading silesia/mr... 1-LZ4_compress : 9970564 ->^M1-LZ4_compress : 9970564 -> 5669255 (56.86%), 268.3 MB/s 1-LZ4_decompress_fast : 9970564 ->^M1-LZ4_decompress_fast : 9970564 -> 788.7 MB/s^MLZ4_decompress_fast : 9970564 -> 788.7 MB/s File: silesia/nci *** Full LZ4 speed analyzer , by Yann Collet (Jul 29 2013) *** Loading silesia/nci... 1-LZ4_compress : 33553445 ->^M1-LZ4_compress : 33553445 -> 5883923 (17.54%), 584.9 MB 1-LZ4_decompress_fast : 33553445 ->^M1-LZ4_decompress_fast : 33553445 -> 1208.3 MB/s^MLZ4_decompress_fast : 33553445 -> 1208.3 MB/s -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira