On Wed, Mar 28, 2012 at 11:43 AM, Peter Daum <ga...@cs.tu-berlin.de> wrote: > ... I was under the illusion, that python (like e.g. perl) stored > strings internally in utf-8. In this case the "conversion" would simple > mean to re-label the data. Unfortunately, as I meanwhile found out, this > is not the case (nor the "apple encoding" ;-), so it would indeed be > pretty useless.
No, unicode strings can be stored internally as any of UCS-1, UCS-2, UCS-4, C wchar strings, or even plain ASCII. And those are all implementation details that could easily change in future versions of Python. > The longer story of my question is: I am new to python (obviously), and > since I am not familiar with either one, I thought it would be advisory > to go for python 3.x. The biggest problem that I am facing is, that I > am often dealing with data, that is basically text, but it can contain > 8-bit bytes. In this case, I can not safely assume any given encoding, > but I actually also don't need to know - for my purposes, it would be > perfectly good enough to deal with the ascii portions and keep anything > else unchanged. You can't generally just "deal with the ascii portions" without knowing something about the encoding. Say you encounter a byte greater than 127. Is it a single non-ASCII character, or is it the leading byte of a multi-byte character? If the next character is less than 127, is it an ASCII character, or a continuation of the previous character? For UTF-8 you could safely assume ASCII, but without knowing the encoding, there is no way to be sure. If you just assume it's ASCII and manipulate it as such, you could be messing up non-ASCII characters. Cheers, Ian -- http://mail.python.org/mailman/listinfo/python-list