John Machin wrote: > Terminology disambiguation: what I call "users" wouldn't know what > 'cp1252' and 'iso-8859-1' were. They're not expected to know. They > just type in whatever characters they can see on their keyboard or > find in the charmap utility. It's what I'd call 'admins' and > 'developers' who should know better, but often don't.
I was talking about 'users' of Python, so they are 'developers'. They often don't know what cp1252 is. > 1. When exchanging data across systems, should not utf-8 be > preferred??? It depends on the data, of course. People writing UTF-8 into text files often find that their editors don't display them correctly, in which case UTF-8 might not be the best choice. For example, the Python source code in CVS is required to be iso-8859-1, primarily because this is what interoperates best across all development platforms. For data in XHTML, the answer would be different: every XML processor is supposed to support UTF-8. > 2. If the Windows *users* have been using characters that are in > cp1252 but not in iso-8859-1, then attempting to convert to iso-8859-1 > will cause an exception. Correct. > I find it a bit hard to imagine that the euro sign wouldn't get a fair > bit of usage in Swedish data processing even if it's not their own > currency. Yes, so the question is how to represent it. It all depends on the application, but it is safer to only assume iso-8859-1 for the moment, unless it is guaranteed that all code that reads the file in really knows what cp1252 is, and what \x80 means in that charset. > 3. How portable is a character set that doesn't include the euro sign? Well, how portable is ASCII? It doesn't support certain characters, sure. If you don't need these characters, this is not a problem. If you do need the extra characters, you need to think thoroughly what encoding meets your needs best. I was merely suggesting that cp1252 is often used without that thought, causing moji-bake later. If representation of the euro sign is an issue, the choices are iso-8859-15, cp1252, and UTF-8. Of those three, I would pick cp1252 last if at all possible, because it is specific to a vendor (i.e. non-standard) Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list