Hello again! I've been running various compression parameters through cod dataset.
It looks like the best compression level in terms of speed is either 1 or 2. The default for Zstd seems to be 3 which would almost always perform worse. For best performance a dictionary of 1024 is optimal, for better compression one might choose larger dictionaries, 6k looks good but I will also run a few benchmarks on larger dicts. Unfortunately, Zstd crashes if sample size is set to more than 16k entries (I guess I should probe the max buffer size where problems begin). I'm attaching two charts which show what's we've got. Compression rate is a fraction of original records size. Time to run is wall clock time the test run. Reasonable compression will increase the run time twofold (of a program that only does text record parsing -> creates objects -> binarylizes them -> compresses -> decompresses). Notation: s{number of bin objects used to train}-d{dictionary length in bytes}-l{compression level}. <http://apache-ignite-developers.2346864.n4.nabble.com/file/t374/chart1.png> Second one is basically a zoom in on the first. <http://apache-ignite-developers.2346864.n4.nabble.com/file/t374/chart2.png> I think that in additional to dictionary compression we should have dictionary-less compression. On typical data of small records it shows compression rate of 0.8 ~ 0.65, but I can imagine that with larger unstructured records it can be as good as dict-based and much less of a hassle dictionary-processing-wise. WDYT? Sorry for the fine prints. I hope my charts will visible. You can see the updated code as pull request: https://github.com/apache/ignite/pull/4673 Regards, -- Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/