[issue2857] Add "java modified utf-8" codec

Serhiy Storchaka Tue, 24 Apr 2012 04:01:46 -0700

Serhiy Storchaka <storch...@gmail.com> added the comment:

As far as I understand, this codec can be implemented in Python. There is no 
need to modify the interpreter core.


def decode_cesu8(b):
    return re.sub('[\uD800-\uDBFF][\uDC00\DFFF]', lambda m: chr(0x10000 | 
((ord(m.group()[0]) & 0x3FF) << 10) | (ord(m.group()[1]) & 0x3FF)), 
b.decode('utf-8', 'surrogatepass'))

def encode_cesu8(s):
    return re.sub('[\U00010000-\U0010FFFF]', lambda m: chr(0xD800 | 
((ord(m.group()) >> 10) & 0x3FF)) + chr(0xDC00 | (ord(m.group() & 0x3FF)), 
s).encode('utf-8', 'surrogatepass')

def decode_mutf8(b):
    return decode_cesu8(b.replace(b'\xC0\x80', b'\x00'))

def encode_mutf8(s):
    return encode_cesu8(s).replace(b'\x00', b'\xC0\x80')

----------
nosy: +storchaka

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue2857>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2857] Add "java modified utf-8" codec

Reply via email to