On Thu, 28 Apr 2005 05:07:34 GMT, [EMAIL PROTECTED] (Bengt Richter) wrote: ... some not quite correct code ;-/ (I copy/pasted and created an illusion. My code dict has no EOS, so I decode pad zero bits as code that a single zero stands for ('a' in this case) so that was an oversight. I should have extended the coding some more or reserved perhaps 'F' as EOString, and tested for EOS in the decode stream.
> This is revised, sorry. You can make a 'way more efficient decoder as an exercise ;-) Hint: if you make a dict of all the 2**width integers as keys where width is your widest code (4 here for 2**4), then you can do a single mask of 2**width-1 and look up the translation to (char, codewidth) directly, according to the codewidth least significant bits, if you make all the states of the other bits into keys that map to the same (char,codewidth). So for example all the even values of out 2**16 would have to map to ('a', 1) and all the values with 2 LSBs of 01 (of which there are 4) would map to ('b',2) and so forth. Thus the incrementing of width and trying one thing after another is not necessary. I think that will work ... Well, now Andrea has something to do ;-) >----< trivcodec.py >------------------------------------------------ #trivcodec.py EOS = '' import itertools def encode(stream): outchar = count = 0 for char in itertools.chain(stream, [EOS]): bits, width = codes[char] outchar |= (bits<<count) count += width while count >= 8: yield chr(outchar&0xff) outchar >>= 8 count -= 8 if count: yield chr(outchar) def decode(stream): codebuf = count = 0 width = 1; char = None for codebyte in stream: codebyte = ord(codebyte) codebuf |= (codebyte<<count) count += 8 if width>count: continue while width <= count: code = (codebuf&((1<<width)-1), width) if code not in chars: width += 1 continue char = chars[code] if char == EOS: break yield char codebuf >>= width count -= width width = 1 if char == EOS: break width = 1 #trivial encoding dict: a=0 b=01 C=0011 D=0111 E=1011 EOS=1111 codes = {'a':(0x00,1), 'b':(0x01, 2), 'C':(0x3+0x4*0,4), 'D':(0x3+0x4*1,4), 'E':(0x3+0x4*2,4), EOS:(0x3+0x4*3,4)} chars = dict([(v,k) for k,v in codes.items()]) # reverse def test(*tests): if not tests: tests = ['abaDECbaabbD'] for charstream in tests: print codestream = ''.join(list(encode(charstream))) print '%r [%s] -(encode)-> %r [%s]' % ( charstream, len(charstream), codestream, len(codestream)) recovered = ''.join(list(decode(codestream))) print '%r [%s] -(decode)-> %r [%s]' % ( codestream, len(codestream), recovered, len(recovered)) if __name__ == '__main__': import sys test(*sys.argv[1:]) >-------------------------------------------------------------------- > >Result: Not really. Tack enough a's on the recovered chars to account for zero fill of the last encoded byte ;-/ > >[22:02] C:\pywk\clp>py24 trivcodec.py > >'abaFECbaabbDF' [13] -(encode)-> '\xf2;Q\xf7' [4] >'\xf2;Q\xf7' [4] -(decode)-> 'abaFECbaabbDF' [13] Fine, because the [4] bytes were totally filled. > >[22:02] C:\pywk\clp>py24 trivcodec.py a b C D E F > >'a' [1] -(encode)-> '\x00' [1] >'\x00' [1] -(decode)-> 'a' [1] Not so fine. There is no character serving as EOS mark. > >'b' [1] -(encode)-> '\x01' [1] >'\x01' [1] -(decode)-> 'b' [1] > >'C' [1] -(encode)-> '\x03' [1] >'\x03' [1] -(decode)-> 'C' [1] > >'D' [1] -(encode)-> '\x07' [1] >'\x07' [1] -(decode)-> 'D' [1] > >'E' [1] -(encode)-> '\x0b' [1] >'\x0b' [1] -(decode)-> 'E' [1] > >'F' [1] -(encode)-> '\x0f' [1] >'\x0f' [1] -(decode)-> 'F' [1] That really produced: [ 8:03] C:\pywk\clp>py24 trivcodec.py a b C D E F 'a' [1] -(encode)-> '\x00' [1] '\x00' [1] -(decode)-> 'aaaaaaaa' [8] 'b' [1] -(encode)-> '\x01' [1] '\x01' [1] -(decode)-> 'baaaaaa' [7] 'C' [1] -(encode)-> '\x03' [1] '\x03' [1] -(decode)-> 'Caaaa' [5] 'D' [1] -(encode)-> '\x07' [1] '\x07' [1] -(decode)-> 'Daaaa' [5] 'E' [1] -(encode)-> '\x0b' [1] '\x0b' [1] -(decode)-> 'Eaaaa' [5] 'F' [1] -(encode)-> '\x0f' [1] '\x0f' [1] -(decode)-> 'Faaaa' [5] So we need to assign a code as the EOS. And probably translate that to '' as the return char, and end on that. So now it does (having given up 1111 for EOS instead of F): [10:22] C:\pywk\clp>py24 trivcodec.py 'abaDECbaabbD' [12] -(encode)-> 'r;Q\xf7' [4] 'r;Q\xf7' [4] -(decode)-> 'abaDECbaabbD' [12] [10:22] C:\pywk\clp>py24 trivcodec.py a b C D E 'a' [1] -(encode)-> '\x1e' [1] '\x1e' [1] -(decode)-> 'a' [1] 'b' [1] -(encode)-> '=' [1] '=' [1] -(decode)-> 'b' [1] 'C' [1] -(encode)-> '\xf3' [1] '\xf3' [1] -(decode)-> 'C' [1] 'D' [1] -(encode)-> '\xf7' [1] '\xf7' [1] -(decode)-> 'D' [1] 'E' [1] -(encode)-> '\xfb' [1] '\xfb' [1] -(decode)-> 'E' [1] Goes to show you, eh? Slice and dice similar lines carefully ;-) Regards, Bengt Richter -- http://mail.python.org/mailman/listinfo/python-list