Changing the (codec) error handler for the stdout/stderr streams in Python 3.0

Jukka Aho Tue, 02 Sep 2008 00:11:33 -0700

Just a tip for those who are only just cutting their teeth on Python 3.0and might have encountered the same problem as I did:

When a Python (3.x) program is run on a terminal that only supports alegacy character encoding - such as Latin 1 or Codepage 437 - allcharacters printed to stdout will be automatically converted from theinterpreter's internal Unicode representation to this legacy characterset.

This is a nice feature to have, of course, but if the original Unicodestring contains characters for which there is no equivalent in theterminal's legacy character set, you will get the dreaded"UnicodeEncodeError" exception.

In other words, the "sys.stdout" stream - as well as the "sys.stderr"stream - have both been hardwired to do their character encoding magic,by default, using the 'strict' error handling scheme:


--- 8< ---

>>> import sys
>>> sys.stdout.errors
'strict'
>>> sys.stderr.errors
'strict'

--- 8< ---

So, essentially, printing out anything but ASCII to stdout is not reallysafe in Python... unless you know beforehand, for sure, what charactersthe terminal will support - which at least in my mind kind of defeatsthe whole purpose of those automatic, implicit conversions.

Now, I have written a more flexible custom error handler myself andregistered it with Python's codec system, using thecodecs.register_error() function. When the handler encounters aproblematic codepoint, it will either suggest a "similar-enough" Latin 1or ASCII substitution for it, or if there is none available in itsinternal conversion table, it will simply print it out using the U+xxxxnotation. The "UnicodeEncodeError" exception will never occur with it.

Instead of creating a custom error handler from scratch one could alsomake use of one of Python's built-in, less restrictive error handlers,such as 'ignore', 'replace', 'xmlcharrefreplace', or 'backslashreplace'.

But in order to make things work as transparently and smoothly aspossible, I needed a way to make both the "sys.stdio" and "sys.stderr"streams actually _use_ my custom error handler, instead of the defaultone.

Unfortunately, the current implementation of io.TextIOWrapper (in Python3.0b2, at least) does not yet offer a public, documented interface forchanging the codec error handler - or, indeed, the target encodingitself - for streams that have already been opened, and this means youcan't "officially" change it for the "stdout" or "stderr" streams,either. (The need for this functionality is acknowledged in PEP-3116,but has apparently not been implemented yet. [1])

So, after examining io.py and scratching my head a bit, here's how onecan currently hack one's way around this limitation:


--- 8< ---

import sys
sys.stdout._errors = 'backslashreplace'
sys.stdout._get_encoder()
sys.stderr._errors = 'backslashreplace'
sys.stderr._get_encoder()

--- 8< ---

Issuing these commands makes printing out Unicode strings to a legacyterminal a safe procedure again and you're not going get unexpected"UnicodeEncodeErrors" thrown in your face any longer. (Note:'backslashreplace' is just an example here; you could substitute theerror handler of your choice for it.)

The downside of this solution is, of course, that it will break down ifthe private implementation of io.TextIOWrapper in io.py changes in thefuture. But as a workaround, I feel it is sufficient for now, whilewaiting for the "real" support to appear in the library.

(If there's a cleaner and more future-proof way of doing the same thingright now, I'd of course love to hear about it...)


_____

1. http://mail.python.org/pipermail/python-3000/2008-April/013366.html

--
znark

--
http://mail.python.org/mailman/listinfo/python-list

Changing the (codec) error handler for the stdout/stderr streams in Python 3.0

Reply via email to