On Wed, Feb 25, 2015 at 2:24 AM, Laura Creighton <l...@openend.se> wrote: > Ah, yes, you are right about that. I see CP-1252 about 2 times every 10 > years, and latin1 every minute of my life, so I am biased to assume I > know what I am seeing.
Fair enough. CP-1252 is still a possibility, but the difference can be dealt with later. > ChrisA, you come from an English speaking country, right? Yes (Australia, to be specific). > For those of us who come from countries whose language doesn't fit in > ASCII, the notion of 'understand the data' doesn't work very well. We > already understand the data -- its a set of words in our native language. > The hard part isn't understanding the data, but rather understanding how > the hell Python could be so stupid as to not understand it. :) The > notion that Python normally only understands the subset of the > characters in your native language than English speakers use in their > language is not the most obvious thing. Also a reasonable baseline assumption; but the trouble is that if you automatically assume that text is encoded in your favourite eight-bit system, you're taking a huge risk. Now, you have a huge leg up on me, in that you actually recognize the *words* in that piece of text. That means you can have MUCH greater confidence in stating that it's Latin-1 than I can. But that's precisely what I mean by "understand the data". If you, being a native French speaker, pick up a file written in (say) Polish, and encoded Latin-2, you'll recognize by the ASCII characters that it's not French text, and probably you'd be able to spot that it ought to be Latin-2 rather than Latin-1. That's understanding the data, that's having more information than just the byte patterns. A computer can't reliably do that (just look up the "Bush hid the facts" bug if you don't believe me), but a human often can. > And having taught countless European kids how to write their very first > program in Python, I can tell you for certain that the sort of deep > understanding of encoding methods is not what 10 year olds who just > want to print out the names of their friends, and their favourite > music titles, and their favourite musicians want to know. :) Right, so you should be teaching them to use Python 3, and always saving everything in UTF-8, and basically ignoring the whole mess of eight-bit encodings :) ChrisA -- https://mail.python.org/mailman/listinfo/python-list