hi, Great! thanks to Rao and Tatu :) I will test them and let you know what I found. regards, Cao Jiguang
------------------------------------------------------------- 发件人:Tatu Saloranta 发送日期:2010-04-02 01:08:52 收件人:u...@cassandra.apache.org 抄送: 主题:Re: compression On Thu, Apr 1, 2010 at 8:27 AM, Rao Venugopal <ven...@gmail.com> wrote: > To Cao Jiguang > > I was watching this presentation on bigtable yesterday > http://video.google.com/videoplay?docid=7278544055668715642# > > and Jeff mentioned that they compared three different compression libraries > BMDiff, LZO and gzip.�� Apparently, gzip was the most cpu intensive and they > ended up going with BMDiff. > I didn't find any Open source / Free implementation of BMDiff but I found > LZO. > http://www.oberhumer.com/opensource/lzo/ Another IMO good alternative is LZF -- it has characteristics similar to LZO. Gzip (i.e. deflate) is a two-phase compressor, with usual lempel-ziv first, then huffman (oldest statistical encoding). LZO, LZF and most other newer simpler but less compressing variants usually only do lempel-ziv. Why LZF? Because there are simple Java free+open implementations: H2 has codec, I ported it to Voldemort, and I think there was talk of generalizing one from H2 as stand-alone codec for reuse. Possibly others may have ported it for other libs/frameworks too (there were multiple jira issues for adding some of these to hadoop). Block format itself is simple, and it is possible to decode adjacent blocks separately by skipping encoded blocks without decoding: this can be used to allow some level of random access (access random block, decode it, access something inside the block). Performance-wise simpler codecs are fast enough to add less overhead than fastest parsing of textual formats (json, xml), but more importantly, they are MUCH faster to write (once again, not much more overhead than format encoding). It is compression speed that really kills gzip, esp. since it is often server that has to do it, for small-requests, large-responses. -+ Tatu +-