Thanks Leo, and everyone else, these were very helpful replies. The issue was exactly as Leo described, and I apologize for not being aware of it, and thus not quite reporting it correctly.
At the moment I don't care about round-tripping between half-width and full-width kana, rather I need only be able to rely on any particular kana character be translated correctly to its half-width or full-width equivalent, and I need the Japanese I send out to be readable. I appreciate the 'implicit versus explicit' point, and have read about it in a few different python mailing lists. In this instance it seems that perl perhaps ought to flash a warning notification regarding what it is doing, but as this conversion between half-width and full-width characters is by far the most logical one available, it also seems reasonable that python might perhaps include such capabilities by default, just as it currently includes the 'replace' option for mapping missed characters generically to '?'. I still haven't worked out the entire mapping routine, but Leo's hint is probably sufficient to get it working with a bit more effort. Again, thanks for the help. -Joe > Thanks that I have my crystal ball working. I can see clearly that the > forth > character of the input is 'HALFWIDTH KATAKANA LETTER ME' (U+FF92) > which is > not present in ISO-2022-JP as defined by RFC 1468 so python converts > it into > question mark as you requested. Meanwhile perl as usual is trying to > guess what > you want and silently converts that character into 'KATAKANA LETTER > ME' (U+30E1) > which is present in ISO-2022-JP. > > > Why can't python properly encode some of these > > characters? > > Because "Explicit is better than implicit". Do you care about > roundtripping? > Do you care about width of characters? What about full-width " (U > +FF02)? Python > doesn't know answers to these questions so it doesn't do anything with > your > input. You have to do it yourself. Assuming you don't care about > roundtripping > and width here is an example demonstrating how to deal with narrow > characters: > > from unicodedata import normalize > iso2022_squeezing = dict((i, normalize('NFKC',unichr(i))) for i in > range(0xFF61,0xFFE0)) > print repr(u'\uFF92'.translate(iso2022_squeezing)) > > It prints u'\u30e1'. Feel free to ask questions if something is not > clear. > > Note, this is just an example, I *don't* claim it does what you want > for any character > in FF61-FFDF range. You may want to carefully review the whole unicode > block:http://www.unicode.org/charts/PDF/UFF00.pdf > > -- Leo. -- http://mail.python.org/mailman/listinfo/python-list