I thought I'd experiment with some of Python's compression utilities. First I thought I'd try compressing some extremely non-random data:
py> import codecs py> data = "something non-random."*1000 py> len(data) 21000 py> len(codecs.encode(data, 'bz2')) 93 py> len(codecs.encode(data, 'zip')) 99 That's really good results. Both the bz2 and Gzip compressors have been able to compress nearly all of the redundancy in the data. What if we shuffle the data so it is more random? py> import random py> data = list(data) py> random.shuffle(data) py> data = ''.join(data) py> len(data); len(codecs.encode(data, 'bz2')) 21000 10494 How about some really random data? py> import string py> data = ''.join(random.choice(string.ascii_letters) for i in range(21000)) py> len(codecs.encode(data, 'bz2')) 15220 That's actually better than I expected: it's found some redundancy and saved about a quarter of the space. What if we try compressing data which has already been compressed? py> cdata = codecs.encode(data, 'bz2') py> len(cdata); len(codecs.encode(cdata, 'bz2')) 15220 15688 There's no shrinkage at all; compression has actually increased the size. What if we use some data which is random, but heavily biased? py> values = string.ascii_letters + ("AAAAAABB")*100 py> data = ''.join(random.choice(values) for i in range(21000)) py> len(data); len(codecs.encode(data, 'bz2')) 21000 5034 So we can see that the bz2 compressor is capable of making use of deviations from uniformity, but the more random the initial data is, the less effective is will be. -- Steve -- https://mail.python.org/mailman/listinfo/python-list