Hello again!

I've been running various compression parameters through cod dataset.

It looks like the best compression level in terms of speed is either 1 or 2.
The default for Zstd seems to be 3 which would almost always perform worse.
For best performance a dictionary of 1024 is optimal, for better compression
one might choose larger dictionaries, 6k looks good but I will also run a
few benchmarks on larger dicts. Unfortunately, Zstd crashes if sample size
is set to more than 16k entries (I guess I should probe the max buffer size
where problems begin).

I'm attaching two charts which show what's we've got. Compression rate is a
fraction of original records size. Time to run is wall clock time the test
run. Reasonable compression will increase the run time twofold (of a program
that only does text record parsing -> creates objects -> binarylizes them ->
compresses -> decompresses). Notation: s{number of bin objects used to
train}-d{dictionary length in bytes}-l{compression level}.
<http://apache-ignite-developers.2346864.n4.nabble.com/file/t374/chart1.png> 
Second one is basically a zoom in on the first.
<http://apache-ignite-developers.2346864.n4.nabble.com/file/t374/chart2.png> 
I think that in additional to dictionary compression we should have
dictionary-less compression. On typical data of small records it shows
compression rate of 0.8 ~ 0.65, but I can imagine that with larger
unstructured records it can be as good as dict-based and much less of a
hassle dictionary-processing-wise. WDYT?
Sorry for the fine prints. I hope my charts will visible.

You can see the updated code as pull request:
https://github.com/apache/ignite/pull/4673

Regards,



--
Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/

Reply via email to