On 3/28/2012 1:43 PM, Peter Daum wrote:
The longer story of my question is: I am new to python (obviously), and since I am not familiar with either one, I thought it would be advisory to go for python 3.x.
I strongly agree with that unless you have reason to use 2.7. Python 3.3 (.0a1 in nearly out) has an improved unicode implementation, among other things.
< The biggest problem that I am facing is, that I
am often dealing with data, that is basically text, but it can contain 8-bit bytes. In this case, I can not safely assume any given encoding, but I actually also don't need to know - for my purposes, it would be perfectly good enough to deal with the ascii portions and keep anything else unchanged.
You are assuming, or must assume, that the text is in an ascii-compatible encoding, meaning that bytes 0-127 really represent ascii chars. Otherwise, you cannot reliably interpret anything, let alone change it.
This problem of knowing that much but not the specific encoding is unfortunately common. It has been discussed among core developers and others the last few months. Different people prefer one of the following approaches.
1. Keep the bytes as bytes and use bytes literals and bytes functions as needed. The danger, as you noticed, is forgetting the 'b' prefix.
2. Decode as if the text were latin-1 and ignore the non-ascii 'latin-1' chars. When done, encode back to 'latin-1' and the non-ascii chars will be as they originally were. The danger is forgetting the pretense, and perhaps passing on the the string (as a string, not bytes) to other modules that will not know the pretense.
3. Decode using encoding = 'ascii', errors='surrogate_escape'. This reversibly encodes the unknown non-ascii chars as 'illegal' non-chars (using the surrogate-pair second-half code units). This is probably the safest in that invalid operations on the non-chars should raise an exception. Re-encoding with the same setting will reproduce the original hi-bit chars. The main danger is passing the illegal strings out of your local sandbox.
-- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list