On 2017-10-24 22:30, Steve D'Aprano <steve+pyt...@pearwood.info> wrote: > On Wed, 25 Oct 2017 07:09 am, Peter J. Holzer wrote: > >> On 2017-10-23 04:21, Steve D'Aprano <steve+pyt...@pearwood.info> wrote: >>> On Mon, 23 Oct 2017 02:29 pm, Stefan Ram wrote: >>>> >>> If the probability of certain codes (either single codes, or sequences of >>> codes) are non-equal, then you can take advantage of that by encoding the >>> common cases into a short representation, and the uncommon and rare cases >>> into a longer representation. As you say: >>> >>> >>>> Otherwise, if ( 0, 0 ) is much more frequent, >>>> we can encode ( 0, 0 ) by "0" and >>>> >>>> ( 0, 1 ) by "101", >>>> ( 1, 0 ) by "110", and >>>> ( 1, 1 ) by "111". >>>> >>>> And we could then use /less/ than two bits on the >>>> average. >>> >>> That's incorrect. On average you use 2.5 bits. >>> >>> (1*1 bit + 3*3 bits divide by four possible outcomes, makes 2.5 bits.) >> >> I disagree. If the distribution is not equal, then the average needs to >> take the different probabilities into account. > > I think I would call that the *weighted* average rather than the average.
If there are 4 guys who are 180 cm tall and one who is 2 metres tall, would you say their average height is 184 cm ((180 * 4 + 200 * 1)/(4 + 1)) or 190 cm ((180 + 200)/2)? For me that's the same situation. > Regardless of what we call it, of course both you and Stefan are right in how > to calculate it, and such a variable-length scheme can be used to compress > the data. > > E.g. given the sequence 00000011 which would take 8 bits in the obvious > encoding, we can encode it as "000111" which takes only 6 bits. > > But the cost of this encoding scheme is that *some* bit sequences expand, e.g. > the 8 bit sequence 11111100 is encoded as "1111111110" which requires 10 > bits. Right. This is most obvious in Huffman encoding, where each symbol is replaced by a sequence of bits which is directly related to the frequency of that symbol. So the letter 'e' might be encoded in 3 or 4 bits (in a corpus of text I happen to have lying around, about 1 in 11 characters is an e and ld(11) = 3.43) while the tilde (about 1 character in 21000) would be encoded in 14 or 15 bits. That's basically how all lossless compression algorithms work: Replacing more frequent bit sequences with shorter sequences and replacing less frequent sequences with longer ones. (Although most operate on variable length byte sequences not individual bytes. I find the LZW algorithm (as used in Unix compress and GIF images) a particularly instructive example). > The end result is that averaged over all possible bit sequences (of a certain > size), this encoding scheme requires MORE space than the obvious 0/1 bits. > > But in practice we don't care much, because the data sequences we care about > are usually not "all possible bit sequences", but a heavily restricted subset > where there are lots of 00 pairs and fewer 01, 10, and 11 pairs. Right. hp -- _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung: |_|_) | | Man feilt solange an seinen Text um, bis | | | h...@hjp.at | die Satzbestandteile des Satzes nicht mehr __/ | http://www.hjp.at/ | zusammenpaĆt. -- Ralph Babel -- https://mail.python.org/mailman/listinfo/python-list