Our tests showed snappy being about 2x the size Gzip but about half the speed. Zstd ended up about the same size as Gzip and as fast as Snappy. That said memory usage was way up with zstd
> On Jul 1, 2021, at 4:16 PM, Sreeram Garlapati <gsreeramku...@gmail.com> wrote: > > Will include Zstd as well, thank you. > However, we are interested in compression speed rather than ratio too. > > On Thu, Jul 1, 2021 at 2:01 PM Ryan Blue <b...@tabular.io > <mailto:b...@tabular.io>> wrote: > You should probably try Zstd while you're at it. We had great results with > Zstd as well. My conclusion was that Zstd is probably the right choice when > you want higher compression ratios and LZ4 was the right choice when you > didn't need great compression but wanted fast compression and decompression > speeds. Zstd pretty much replaces gzip and LZ4 replaces snappy. > > On Thu, Jul 1, 2021 at 1:59 PM Sreeram Garlapati <gsreeramku...@gmail.com > <mailto:gsreeramku...@gmail.com>> wrote: > Slick, thanks @Ryan Blue <mailto:b...@tabular.io>. We will add LZ4 to our mix > and report back if we find anything different. > > On Thu, Jul 1, 2021 at 1:50 PM Ryan Blue <b...@tabular.io > <mailto:b...@tabular.io>> wrote: > The default should probably be LZ4. In our testing, LZ4 beat snappy for every > dataset for read time, write time, and compression ratio. I believe it also > typically got a better compression ratio than gzip. Gzip was the previous > default because it does a better job on compression ratio than snappy. > > Ryan > > On Thu, Jul 1, 2021 at 1:48 PM Sreeram Garlapati <gsreeramku...@gmail.com > <mailto:gsreeramku...@gmail.com>> wrote: > Hello Iceberg devs! > > Do any of you folks use the underlying file format as Parquet + Snappy. > Iceberg configures this by default as Parquet + gzip > (write.parquet.compression-codec). > Is there any specific reason for this Choice? > > In our preliminary tests we found better numbers with Parquet + Snappy than > with gzip. > Operation = compress and write to local disk > File Size = 524.3MB (about the same with both the compression codecs) > row group size = 64mb. > > gzip snappy > 8.304 > 5.478 > > > We are still in the process of our full benchmarking (for reads) - but, want > to understand - if there is a whole different angle to this that we are not > thinking thru. > > Truly appreciate any inputs, > Sreeram > > > -- > Ryan Blue > Tabular > > > -- > Ryan Blue > Tabular