OK, for those interested in this sort of thing, this is what I now think is necessary to work with Unicode in python. Thanks to those who gave feedback, and to Cliff in particular (but any remaining misconceptions are my own!) Here are the results of my attempts to come to grips with this. Comments/corrections welcome...
(Note this is not about which characters one expects to match with \w etc. when compiling regexes with python's re.UNICODE flag. It's about the encoding of one's source strings when building regexes, in order to match against strings read from files of various encodings.) I/O: READ TO/WRITE FROM UNICODE STRING OBJECTS. Always use codecs to read from a specific encoding to a python Unicode string, and use codecs to encode to a specific encoding when writing the processed data. codecs.read() delivers a Unicode string from a specific encoding, and codecs.write() will put the Unicode string into a specific encoding (be that an encoding of Unicode such as UTF-8). SOURCE: Save the source as UTF-8 in your editor, tell python with "# - *- coding: utf-8 -*-", and construct all strings with u'' (or ur'' instead of r''). Then, when you're concatenating strings constructed in your source with strings read with codecs, you needn't worry about conversion issues. (When concatenating byte strings from your source with Unicode strings, python will, without an explicit decode, assume the byte string is ASCII which is a subset of Unicode (IS0-8859-1 isn't).) Even if you save the source as UTF-8, tell python with "# -*- coding: utf-8 -*-", and say myString = "blah", myString is a byte string. To construct a Unicode string you must say myString = u"blah" or myString = unicode("blah"), even if your source is UTF-8. Typing 'u' when constructing all strings isn't too arduous, and less effort than passing selected non-ASCII source strings to unicode() and needing to remember where to do it. (You could easily slip a non- ASCII char into a byte string in your code because most editors and default system encodings will allow this.) Doing everything in Unicode simplifies life. Since the source is now UTF-8, and given Unicode support in the editor, it doesn't matter whether you use Unicode escape sequences or literal Unicode characters when constructing strings, since >>> u"á" == u"\u00E1" True REGEXES: I'm a bit less certain about regexes, but this is how I think it's going to work: Now that my regexes are constructed from Unicode strings, and those regexes will be compiled to match against Unicode strings read with codecs, any potential problems with encoding conversion disappears. If I put an en-dash into a regex built using u'', and I happen to have read the file in the ASCII encoding which doesn't support en-dashes, the regex simply won't match because the pattern doesn't exist in the /Unicode/ string served up by codecs. There's no actual problem with my string encoding handling, it just means I'm looking for the wrong chars in a Unicode string read from a file not saved in a Unicode encoding. tIM -- http://mail.python.org/mailman/listinfo/python-list