On 02/19/2015 10:45 AM, janhein.vanderb...@gmail.com wrote:
On Wednesday, February 18, 2015 at 11:20:12 PM UTC+1, Dave Angel wrote:
I'm not necessarily doubting it, just challenging you to provide a data
sample that actually shows it.  And of course, I'm not claiming that
7bit is in any way optimal.  You cannot define optimal without first
defining the distribution.
Weird results.
In what way weird?

For a character size 2 the growth processes are shown below.
I listed the decimal representations, the difficult representation, a stop bit 
encoding, and the number of characters they differ in length:
0:  00                          00                              0
1:  01                          01                              0
2:  10, 00                      10, 00                          0
3:  10, 01                      10, 01                          0
4:  10, 10                      11, 00                          0
5:  10, 11                      11, 01                          0
6:  11, 00.00                   11, 10, 00                      0
7:  11, 00.01                   11, 10, 01                      0
8:  11, 00.10                   11, 11, 00                      0
9:  11, 00.11                   11, 11, 01                      0
10: 11, 01.00                   11, 11, 10, 00                  1
11: 11, 01.01                   11, 11, 10, 01                  1
12: 11, 01.10                   11, 11, 11, 00                  1
13: 11, 01.11                   11, 11, 11, 01                  1
14: 11, 10.00, 00               11, 11, 11, 10, 00              1
15: 11, 10.00, 01               11, 11, 11, 10, 01              1
16: 11, 10.00, 10               11, 11, 11, 11, 00              1
17: 11, 10.00, 11               11, 11, 11, 11, 01              1
18: 11, 10.01, 00.00            11, 11, 11, 11, 10, 00          1
19: 11, 10.01, 00.01            11, 11, 11, 11, 10, 01          1
20: 11, 10.01, 00.10            11, 11, 11, 11, 11, 00          1
21: 11, 10.01, 00.11            11, 11, 11, 11, 11, 01          1
22: 11, 10.01, 01.00            11, 11, 11, 11, 11, 10, 00      2
23: 11, 10.01, 01.01            11, 11, 11, 11, 11, 10, 01      2
24: 11, 10.01, 01.10            11, 11, 11, 11, 11, 11, 00      2
25: 11, 10.01, 01.11            11, 11, 11, 11, 11, 11, 01      2
26: 11, 10.01, 10.00            11, 11, 11, 11, 11, 11, 10, 00  3

I didn't take the time to prove it mathematically, but these results suggest to 
me that the complicated encoding beats the stop bit encoding.

Clearly, one wouldn't use what you call "stop bit encoding" with a 2bit 
character.  And since the Python code you posted uses an 8bit character, 
I can't validate the "difficult" column either.
I wrote the following pair of functions:

def seven_code(n):
    acc = bytearray()
    if n == 0:
        acc.append(0)
    while n > 0:
        quotient, remainder = divmod(n, 128)
        acc.append(remainder)
        n = quotient
    acc[-1] |= 0x80            #add a stop bit to last byte
    return acc

def seven_decode(sequ):
    acc = 0
    multiplier = 1
    for by in sequ:
        acc += (by & 0x7f) * multiplier
        if by > 0x7f:
            break
        multiplier *= 128
    return acc

Here's a couple of ranges of output, showing that the 7bit scheme does better for values between 384 and 16379.
382 2 80fe --- 2 7e82
383 2 80ff --- 2 7f82
384 3 810000 --- 2 0083
384  jan grew 3 810000
385 3 810001 --- 2 0183
386 3 810002 --- 2 0283
387 3 810003 --- 2 0383
388 3 810004 --- 2 0483
389 3 810005 --- 2 0583

16380 3 813e7c --- 2 7cff
16380  jan grew 3 813e7c
16380 7bit grew 2 7cff
16381 3 813e7d --- 2 7dff
16382 3 813e7e --- 2 7eff
16383 3 813e7f --- 2 7fff
16384 3 813e80 --- 3 000081
16384 7bit grew 3 000081
16385 3 813e81 --- 3 010081
16386 3 813e82 --- 3 020081
16387 3 813e83 --- 3 030081
16388 3 813e84 --- 3 040081
16389 3 813e85 --- 3 050081

In all my experimenting, I haven't found any values where the 7bit scheme does worse. It seems likely that for extremely large integers, it will, but if those are to be the intended distribution, the 7bit scheme could be replaced by something else, like just encoding a length at the beginning, and using raw bytes after that.

--
DaveA
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to