On Wed, 28 Mar 2012 19:43:36 +0200, Peter Daum wrote: > The longer story of my question is: I am new to python (obviously), and > since I am not familiar with either one, I thought it would be advisory > to go for python 3.x. The biggest problem that I am facing is, that I am > often dealing with data, that is basically text, but it can contain > 8-bit bytes.
All bytes are 8-bit, at least on modern hardware. I think you have to go back to the 1950s to find 10-bit or 12-bit machines. > In this case, I can not safely assume any given encoding, > but I actually also don't need to know - for my purposes, it would be > perfectly good enough to deal with the ascii portions and keep anything > else unchanged. Well you can't do that, because *by definition* you are changing a CHARACTER into ONE OR MORE BYTES. So the question you have to ask is, *how* do you want to change them? You can use an error handler to convert any untranslatable characters into question marks, or to ignore them altogether: bytes = string.encode('ascii', 'replace') bytes = string.encode('ascii', 'ignore') When going the other way, from bytes to strings, it can sometimes be useful to use the Latin-1 encoding, which essentially cannot fail: string = bytes.decode('latin1') although the non-ASCII chars that you get may not be sensible or meaningful in any way. But if there are only a few of them, and you don't care too much, this may be a simple approach. But in a nutshell, it is physically impossible to map the millions of Unicode characters to just 256 possible bytes without either throwing some characters away, or performing an encoding. > As it seems, this would be far easier with python 2.x. It only seems that way until you try. -- Steven -- http://mail.python.org/mailman/listinfo/python-list