On 2 June 2014 11:14, Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote:
> On Mon, 02 Jun 2014 08:54:33 +1000, Tim Delaney wrote: > > I'm currently working on a product that interacts with lots of other > > products. These other products can be using any encoding - but most of > > the functions that interact with I/O assume the system default encoding > > of the machine that is collecting the data. The product has been in > > production for nearly a decade, so there's a lot of pushback against > > changes deep in the code for fear that it will break working systems. > > The fact that they are working largely by accident appears to escape > > them ... > > > > FWIW, changing to use iso-latin-1 by default would be the most sensible > > option (effectively treating everything as bytes), with the option for > > another encoding if/when more information is known (e.g. there's often a > > call to return the encoding, and the output of that call is guaranteed > > to be ASCII). > > Python 2 does what you suggest, and it is *broken*. Python 2.7 creates > moji-bake, while Python 3 gets it right: > The purpose of my example was to show a case where no thought was put into encodings - the assumption was that the system encoding and the remote system encoding would be the same. This is most definitely not the case a lot of the time. I also should have been more clear that *in the particular situation I was talking about* iso-latin-1 as default would be the right thing to do, not in the general case. Quite often we won't know the correct encoding until we've executed a command via ssh - iso-latin-1 will allow us to extract the info we need (which will generally be 7-bit ASCII) without the possibility of an invalid encoding. Sure we may get mojibake, but that's better than the alternative when we don't yet know the correct encoding. > Latin-1 is one of those legacy encodings which needs to die, not to be > entrenched as the default. My terminal uses UTF-8 by default (as it > should), and if I use the terminal to input "δжç", Python ought to see > what I input, not Latin-1 moji-bake. > For some purposes, there needs to be a way to treat an arbitrary stream of bytes as an arbitrary stream of 8-bit characters. iso-latin-1 is a convenient way to do that. It's not the only way, but settling on it and being consistent is better than not having a way. Tim Delaney
-- https://mail.python.org/mailman/listinfo/python-list