On Wednesday, 28 August 2019 12:57:33 UTC+2, Nigel Tao wrote: > > On Wed, Aug 28, 2019 at 7:11 PM Klaus Post <klau...@gmail.com > <javascript:>> wrote: > > TLDR; LZ4 is typically between the default and "better" mode of s2. > > Nice! > > Just a suggestion: rename "better" to either "betterSize / smaller" > (i.e. better compression ratio, worse throughput) or "betterSpeed / > faster", otherwise it's not immediately obvious on which axis "better" > is better in. (Or if it's better in both axes, why not just always > turn it on). > > > Also, from > https://github.com/klauspost/compress/tree/master/s2#format-extensions: > > Format Extensions > > ... > > Framed compressed blocks can be up to 4MB (up from 64KB). > > Do you know how much of the size or speed gains come from bumping 64K > to 4M? Put another way, do you still see good size/speed gains if you > do everything else other than bump this number? One of the original > design goals was to limit the amount of memory needed for > decompressing an arbitrary snappy stream. >
Then it depends more on the content. Especially machine generated content suffers a lot. Here is the size/speed of the compression, with 64KB block size restriction on S2: | file | out | level | insize | outsize | millis | mb/s | |-----------------------------|--------|-------|-------------|------------|--------|---------| | rawstudio-mint14.tar | snappy | 1 | 8558382592 | 4796307031 | 17814 | 458.17 | | rawstudio-mint14.tar | s2 | 1 | 8558382592 | 4741757590 | 13157 | 620.3 | | rawstudio-mint14.tar | s2 | 2 | 8558382592 | 4568241215 | 25962 | 314.37 | | file | out | level | insize | outsize | millis | mb/s | | github-june-2days-2019.json | snappy | 1 | 6273951764 | 1525176492 | 12085 | 495.07 | | github-june-2days-2019.json | s2 | 1 | 6273951764 | 1416959675 | 8496 | 704.17 | | github-june-2days-2019.json | s2 | 2 | 6273951764 | 1344106237 | 13165 | 454.45 | | file | out | level | insize | outsize | millis | mb/s | | github-ranks-backup.bin | snappy | 1 | 1862623243 | 589837541 | 3785 | 469.21 | | github-ranks-backup.bin | s2 | 1 | 1862623243 | 637075877 | 2869 | 619.01 | | github-ranks-backup.bin | s2 | 2 | 1862623243 | 579284488 | 4411 | 402.61 | | file | out | level | insize | outsize | millis | mb/s | | consensus.db.10gb | snappy | 1 | 10737418240 | 5412897703 | 22273 | 459.75 | | consensus.db.10gb | s2 | 1 | 10737418240 | 5303306037 | 15025 | 681.5 | | consensus.db.10gb | s2 | 2 | 10737418240 | 5377169131 | 71860 | 142.5 | | file | out | level | insize | outsize | millis | mb/s | | adresser.json | snappy | 1 | 7983034785 | 809697891 | 7997 | 951.91 | | adresser.json | s2 | 1 | 7983034785 | 568975289 | 4944 | 1539.86 | | adresser.json | s2 | 2 | 7983034785 | 527205659 | 9675 | 786.88 | | file | out | level | insize | outsize | millis | mb/s | | gob-stream | snappy | 1 | 1911399616 | 457984585 | 3668 | 496.85 | | gob-stream | s2 | 1 | 1911399616 | 440972884 | 2528 | 720.91 | | gob-stream | s2 | 2 | 1911399616 | 426735950 | 3982 | 457.67 | | file | out | level | insize | outsize | millis | mb/s | | 10gb.tar | snappy | 1 | 10065157632 | 6056946612 | 23905 | 401.54 | | 10gb.tar | s2 | 1 | 10065157632 | 6170962298 | 22430 | 427.95 | | 10gb.tar | s2 | 2 | 10065157632 | 5775224369 | 31923 | 300.69 | | file | out | level | insize | outsize | millis | mb/s | | sharnd.out.2gb | snappy | 1 | 2147483647 | 2147745801 | 327 | 6261.61 | | sharnd.out.2gb | s2 | 1 | 2147483647 | 2147745801 | 356 | 5751.51 | | sharnd.out.2gb | s2 | 2 | 2147483647 | 2147745801 | 485 | 4221.73 | | file | out | level | insize | outsize | millis | mb/s | | enwik9 | snappy | 1 | 1000000000 | 508028601 | 4098 | 232.66 | | enwik9 | s2 | 1 | 1000000000 | 523651446 | 2736 | 348.49 | | enwik9 | s2 | 2 | 1000000000 | 461872591 | 4689 | 203.38 | | file | out | level | insize | outsize | millis | mb/s | | silesia.tar | snappy | 1 | 211947520 | 103008711 | 634 | 318.74 | | silesia.tar | s2 | 1 | 211947520 | 103019781 | 463 | 436.47 | | silesia.tar | s2 | 2 | 211947520 | 95522535 | 792 | 255.16 | S2, GOMAXPROCS=1, Snappy no asm. S2 is faster in 9/10, and compression is better in 7/10 cases. Using 'better' is always better than Snappy, but also slower. > https://github.com/google/snappy/blob/master/framing_format.txt says: > "the uncompressed data in a chunk must be no longer than 65536 bytes. > This allows consumers to easily use small fixed-size buffers". > Yeah. The size impact is quite huge, so it doesn't make sense to have that restriction by default IMO. But it is configurable in S2, so you can just set it to what you want. There is only really one place in the decoder where it affects allocations - the allocation of the input buffer, and that could easily be adjusted to only allocate space as it needs it, at the cost of a few more allocations. I considered adding the max block size to the stream, but eventually dropped it since it had so little practical impact and I wanted to keep the changes to an absolute minimum. In most environments allocating 4MB isn't a big deal for reading a stream, especially since the decoder can be reused. Except for maybe embedded systems I don't see this coming much into play. Blocks already allocate the full space needed for ~2x the input, so there is no difference to Snappy there - except that the max block size overhead is smaller for S2. /Klaus -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/c07f439d-dbb3-4685-8f7a-d62a09a1c487%40googlegroups.com.