On Thu, Jul 14, 2016 at 6:16 PM, Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote: > How about some really random data? > > py> import string > py> data = ''.join(random.choice(string.ascii_letters) for i in range(21000)) > py> len(codecs.encode(data, 'bz2')) > 15220 > > That's actually better than I expected: it's found some redundancy and saved > about a quarter of the space.
What it found was an imbalance in the frequencies of byte values - you used 52 values lots of times, and the other 204 never. Huffman coding means those 52 values will get fairly short codes, and if you happened to have just one or two other byte values, they'd be represented by longer codes. It's like the Morse code - by encoding some letters with very short sequences (dot followed by end-of-letter for E, dash followed by end-of-letter for T) and others with much longer sequences (dash-dot-dot-dash-EOL for X), it manages a fairly compact representation of typical English text. The average Morse sequence length for a letter is 3.19, but on real-world data... well, I used the body of your email as sample text (yes, I'm aware it's not all English), and calculated a weighted average of 2.60. (Non-alphabetics are ignored, and the text is case-folded.) Using the entire text of Gilbert & Sullivan's famous operettas, or the text of "The Beauty Stone", or the wikitext source of the Wikipedia article on Morse code, gave similar results (ranging from 2.56 to 2.60); interestingly, a large slab of Lorem Ipsum skewed the numbers slightly lower (2.52), not higher as I was afraid it would, being more 'random'. Further example: os.urandom() returns arbitrary byte values, and (in theory, at least) has equal probability of returning every possible value. Base 64 encoding that data makes three bytes come out as four. Check this out: >>> data = os.urandom(21000) >>> len(base64.b64encode(data)) # just to make sure 28000 >>> len(codecs.encode(data, 'bz2')) 21458 >>> len(codecs.encode(base64.b64encode(data), 'bz2')) 21290 When you remove the redundancy in b64-encoded data, you basically... get back what you started with. (Curiously, several repeated os.urandommings showed consistent results to the above - 214xx for direct 'compression' vs 212xx for b64-then-compress. But in both cases, it's larger than the 21000 bytes of input.) ChrisA -- https://mail.python.org/mailman/listinfo/python-list