On 2012-03-28 23:37, Terry Reedy wrote: > 2. Decode as if the text were latin-1 and ignore the non-ascii 'latin-1' > chars. When done, encode back to 'latin-1' and the non-ascii chars will > be as they originally were.
... actually, in the beginning of my quest, I ran into an decoding exception trying to read data as "latin1" (which was more or less what I had expected anyway because byte values between 128 and 160 are not defined there). Obviously, I must have misinterpreted something there; I just ran a little test: l=[i for i in range(256)]; b=bytes(l) s=b.decode('latin1'); b=s.encode('latin1'); s=b.decode('latin1') for c in s: print(hex(ord(c)), end=' ') if (ord(c)+1) % 16 ==0: print("") print() ... and got all the original bytes back. So it looks like I tried to solve a problem that did not exist to start with (the problems, I ran into then were pretty real, though ;-) > 3. Decode using encoding = 'ascii', errors='surrogate_escape'. This > reversibly encodes the unknown non-ascii chars as 'illegal' non-chars > (using the surrogate-pair second-half code units). This is probably the > safest in that invalid operations on the non-chars should raise an > exception. Re-encoding with the same setting will reproduce the original > hi-bit chars. The main danger is passing the illegal strings out of your > local sandbox. Unfortunately, this is a very well-kept secret unless you know that something with that name exists. The options currently mentioned in the documentation are not really helpful, because the non-decodeable will be lost. With some trying, I got it to work, too (the option is named "surrogateescape" without the "_" and in python 3.1 it exists, but only not as a keyword argument: "s=b.decode('utf-8','surrogateescape')" ...) Thank you very much for your constructive advice! Regards, Peter -- http://mail.python.org/mailman/listinfo/python-list