MonkeeSage wrote: > On Dec 3, 1:31 am, MonkeeSage <[EMAIL PROTECTED]> wrote: >> On Dec 2, 11:46 pm, Michael Spencer <[EMAIL PROTECTED]> wrote: >> >> >> >>> Michael Goerz wrote: >>>> Hi, >>>> I am writing unicode stings into a special text file that requires to >>>> have non-ascii characters as as octal-escaped UTF-8 codes. >>>> For example, the letter "Í" (latin capital I with acute, code point 205) >>>> would come out as "\303\215". >>>> I will also have to read back from the file later on and convert the >>>> escaped characters back into a unicode string. >>>> Does anyone have any suggestions on how to go from "Í" to "\303\215" and >>>> vice versa? >>> Perhaps something along the lines of: >>> >>> def encode(source): >>> ... return "".join("\%o" % ord(c) for c in source.encode('utf8')) >>> ... >>> >>> def decode(encoded): >>> ... bytes = "".join(chr(int(c, 8)) for c in encoded.split('\\')[1:]) >>> ... return bytes.decode('utf8') >>> ... >>> >>> encode(u"Í") >>> '\\303\\215' >>> >>> print decode(_) >>> Í >>> HTH >>> Michael >> Nice one. :) If I might suggest a slight variation to handle cases >> where the "encoded" string contains plain text as well as octal >> escapes... >> >> def decode(encoded): >> for octc in (c for c in re.findall(r'\\(\d{3})', encoded)): >> encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8))) >> return encoded.decode('utf8') >> >> This way it can handle both "\\141\\144\\146\\303\\215\\141\\144\\146" >> as well as "adf\\303\\215adf". >> >> Regards, >> Jordan > > err... > > def decode(encoded): > for octc in re.findall(r'\\(\d{3})', encoded): > encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8))) > return encoded.decode('utf8') Great suggestions from both of you! I came up with my "final" solution based on them. It encodes only non-ascii and non-printables, and stays in unicode strings for both input and output. Also, low ascii values now encode into a 3-digit octal sequence also, so that decode can catch them properly.
Thanks a lot, Michael ____________ import re def encode(source): encoded = "" for character in source: if (ord(character) < 32) or (ord(character) > 128): for byte in character.encode('utf8'): encoded += ("\%03o" % ord(byte)) else: encoded += character return encoded.decode('utf-8') def decode(encoded): decoded = encoded.encode('utf-8') for octc in re.findall(r'\\(\d{3})', decoded): decoded = decoded.replace(r'\%s' % octc, chr(int(octc, 8))) return decoded.decode('utf8') orig = u"blaÍblub" + chr(10) enc = encode(orig) dec = decode(enc) print orig print enc print dec -- http://mail.python.org/mailman/listinfo/python-list