On Mon, 02 Jun 2014 08:54:33 +1000, Tim Delaney wrote: > On 1 June 2014 12:26, Steven D'Aprano > <steve+comp.lang.pyt...@pearwood.info> wrote: > > >> "with cross-platform behavior preferred over system-dependent one" -- >> It's not clear how cross-platform behaviour has anything to do with the >> Internet age. Python has preferred cross-platform behaviour forever, >> except for those features and modules which are explicitly intended to >> be interfaces to system-dependent features. (E.g. a lot of functions in >> the os module are thin wrappers around OS features. Hence the name of >> the module.) >> >> > There is the behaviour of defaulting input and output to the system > encoding.
That's a tricky one, but I think on balance that is a case where defaulting to the system encoding is the right thing to do. Input and out occurs on the local system you are running on, which by definition isn't cross-platform. (Non-local I/O is possible, but requires work -- it doesn't just happen.) > I personally think we would all be better off if Python (and > Java, and many other languages) defaulted to UTF-8. This hopefully would > eventually have the effect of producers changing to output UTF-8 by > default, and consumers learning to manually specify an encoding when > it's not UTF-8 (due to invalid codepoints). UTF-8 everywhere should be our ultimate aim. Then we can forget about legacy encodings except when digging out ancient documents from archived floppy disks :-) > I'm currently working on a product that interacts with lots of other > products. These other products can be using any encoding - but most of > the functions that interact with I/O assume the system default encoding > of the machine that is collecting the data. The product has been in > production for nearly a decade, so there's a lot of pushback against > changes deep in the code for fear that it will break working systems. > The fact that they are working largely by accident appears to escape > them ... > > FWIW, changing to use iso-latin-1 by default would be the most sensible > option (effectively treating everything as bytes), with the option for > another encoding if/when more information is known (e.g. there's often a > call to return the encoding, and the output of that call is guaranteed > to be ASCII). Python 2 does what you suggest, and it is *broken*. Python 2.7 creates moji-bake, while Python 3 gets it right: [steve@ando ~]$ python2.7 -c "print u'δжç'" δжç [steve@ando ~]$ python3.3 -c "print(u'δжç')" δжç Latin-1 is one of those legacy encodings which needs to die, not to be entrenched as the default. My terminal uses UTF-8 by default (as it should), and if I use the terminal to input "δжç", Python ought to see what I input, not Latin-1 moji-bake. If I were to use Windows with a legacy code page, then I couldn't even enter "δжç" on the command line since none of the legacy encodings support that set of characters at the same time. I don't know exactly what I would get if I tried (say, by copying and pasting text from a Unicode-aware application), but I'd see that it was weird *in the shell* before it even reaches Python. On the other hand, if I were to input something supported by the legacy encoding, let's say I entered "αβγ" while using ISO-8859-7 (Greek), then Python ought to see "αβγ" and not moji-bake: py> b = "αβγ".encode('iso-8859-7') # what the shell generates py> b.decode('latin-1') # what Python interprets those bytes as 'áâã' Defaulting to the system encoding means that Python input and output just works, to the degree that input and output on your system just works. If your system is crippled by the use of a legacy encoding, then Python will at least be *no worse* than your system. -- Steven D'Aprano http://import-that.dreamwidth.org/ -- https://mail.python.org/mailman/listinfo/python-list