> Though, I'm also wondering about about performance difference between >the two. Since they both use native implementations, theoretically they >can be close in performance.
ZlibCompressor block compression was extremely slow due to the non-JNI bits in Hadoop - <https://issues.apache.org/jira/browse/HADOOP-10681> When I last benchmarked after that issue was fixed 86% of CPU samples were spent inside zlib.so in the perf traces - irrespective of which mode it was used. The result of those profiles went into making ORC fit into Zlib better, avoid doing compression work twice - ORC did its own versions of dictionary+rle+bit-packing already. <http://www.slideshare.net/Hadoop_Summit/orc-2015-faster-better-smaller-494 81231/22> For instance, bit-packing 127 bit data into 7 bits and then compressing it offered less compression (& cost more CPU) than leaving it at 8 bits without reduction. LZ77 worked much better and the huffman anyway compressed the data by bit-packing anyway. The impact was more visible at higher bit-counts (like 27 bits is way worse than 32 bits). And then turning off bits of Zlib not necessary for some encoding patterns - Z_FILTERED for instance for numeric sequences, Z_TEXT for the string dicts etc. Purely from a performance standpoint, I'm getting more interested in Zstd, because it brings a whole new way of fast bit-packing. <https://issues.apache.org/jira/browse/ORC-45> Cheers, Gopal