On Nov 27, 12:27 am, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > * When I say "# -*- coding: utf-8 -*-" and confirm my IDE is saving > > the source file as UTF-8, do I still need to prefix all the strings > > constructed in the source with u as in myStr = u"blah", even when > > those strings contain only ASCII or ISO-8859-1 chars? (It would be a > > bother for me to do this for the complete source I'm working on, where > > I rarely need chars outside the ISO-8859-1 range.) > > Depends on what you want to achieve. If you don't prefix your strings > with u, they will stay byte string objects, and won't become Unicode > strings. That should be fine for strings that are pure ASCII; for > ISO-8859-1 strings, I recommend it is safer to only use Unicode > objects to represent such strings. > > In Py3k, that will change - string literals will automatically be > Unicode objects. > > > * Will python figure it out if I use different encodings in different > > modules -- say a main source file which is "# -*- coding: utf-8 -*-" > > and an imported module which doesn't say this (for which python will > > presumably use a default encoding)? > > Yes, it will. The encoding declaration is per-module. > > > * If I want to use a Unicode char in a regex -- say an en-dash, U+2013 > > -- in an ASCII- or ISO-8859-1-encoded source file, can I say > > > myASCIIRegex = re.compile('[A-Z]') > > myUniRegex = re.compile(u'\u2013') # en-dash > > > then read the source file into a unicode string with codecs.read(), > > then expect re to match against the unicode string using either of > > those regexes if the string contains the relevant chars? Or do I need > > to do make all my regex patterns unicode strings, with u""? > > It will work fine if the regular expression restricts itself to ASCII, > and doesn't rely on any of the locale-specific character classes (such > as \w). If it's beyond ASCII, or does use such escapes, you better make > it a Unicode expression. > > I'm not actually sure what precisely the semantics is when you match > an expression compiled from a byte string against a Unicode string, > or vice versa. I believe it operates on the internal representation, > so \xf6 in a byte string expression matches with \u00f6 in a Unicode > string; it won't try to convert one into the other. > > Regards, > Martin
Thanks Martin, that's a very helpful response to what I was concerned might be an overly long query. Yes, I'd read that in Py3k the distinction between byte strings and Unicode strings would disappear -- I look forward to that... Tim -- http://mail.python.org/mailman/listinfo/python-list