On Oct 23, 3:37 am, kettle <[EMAIL PROTECTED]> wrote: > Hi, > I am rather new to python, and am currently struggling with some > encoding issues. I have some utf-8-encoded text which I need to > encode as iso-2022-jp before sending it out to the world. I am using > python's encode functions: > -- > var = var.encode("iso-2022-jp", "replace") > print var > -- > > I am using the 'replace' argument because there seem to be a couple > of utf-8 japanese characters which python can't correctly convert to > iso-2022-jp. The output looks like this: > ↓東京???日比谷線?北千住行 > > However if use perl's encode module to re-encode the exact same bit > of text: > -- > $var = encode("iso-2022-jp", decode("utf8", $var)) > print $var > -- > > I get proper output (no unsightly question-marks): > ↓東京メトロ日比谷線・北千住行 > > So, what's the deal?
Thanks that I have my crystal ball working. I can see clearly that the forth character of the input is 'HALFWIDTH KATAKANA LETTER ME' (U+FF92) which is not present in ISO-2022-JP as defined by RFC 1468 so python converts it into question mark as you requested. Meanwhile perl as usual is trying to guess what you want and silently converts that character into 'KATAKANA LETTER ME' (U+30E1) which is present in ISO-2022-JP. > Why can't python properly encode some of these > characters? Because "Explicit is better than implicit". Do you care about roundtripping? Do you care about width of characters? What about full-width " (U +FF02)? Python doesn't know answers to these questions so it doesn't do anything with your input. You have to do it yourself. Assuming you don't care about roundtripping and width here is an example demonstrating how to deal with narrow characters: from unicodedata import normalize iso2022_squeezing = dict((i, normalize('NFKC',unichr(i))) for i in range(0xFF61,0xFFE0)) print repr(u'\uFF92'.translate(iso2022_squeezing)) It prints u'\u30e1'. Feel free to ask questions if something is not clear. Note, this is just an example, I *don't* claim it does what you want for any character in FF61-FFDF range. You may want to carefully review the whole unicode block: http://www.unicode.org/charts/PDF/UFF00.pdf -- Leo. -- http://mail.python.org/mailman/listinfo/python-list