On Mon, 13 Jun 2016 11:36 pm, Random832 wrote: > On Mon, Jun 13, 2016, at 06:35, Steven D'Aprano wrote: >> But this is a Python forum, and Python 3 is a language that tries >> very, very hard to keep a clean separation between bytes and text, > > Yes, but that doesn't mean that you're right
As you already know, but others might not, I asked on the Python-Dev list why b64encode has the behaviour it has: https://mail.python.org/pipermail/python-dev/2016-June/145166.html **Even if** your interpretation of RFC-989 etc are correct, Python is not bound to follow their interpretation. The RFC is a network protocol, Python is a programming language, and our libraries can do whatever makes sense for *programming*. And the people who migrated the Python 2 base64 lib to Python 3 thought that it made more sense to have the functions operate on bytes and return bytes. Other languages have made other choices: Microsoft's base64 library in C#, C++, F# and VB takes an array of bytes as input, and outputs a UTF-16 string: https://msdn.microsoft.com/en-us/library/dhx0d524%28v=vs.110%29.aspx Java's base64 encoder takes and returns bytes: https://docs.oracle.com/javase/8/docs/api/java/util/Base64.Encoder.html Javascript's Base64 encoder takes input as UTF-16 encoded text and returns the same: https://developer.mozilla.org/en-US/docs/Web/API/WindowBase64/Base64_encoding_and_decoding RFC 989 says that their unnamed "Encode to Printable Form" uses implementation independent characters: The bits resulting from the encryption operation are encoded into characters which are universally representable at all sites, though not necessarily with the same bit patterns (e.g., although the character "E" is represented in an ASCII-based system as hexadecimal 45 and as hexadecimal C5 in an EBCDIC-based system, the local significance of the two representations is equivalent). https://tools.ietf.org/html/rfc989 But I'm not sure how RFC 989 intends this to work in practice. If you encrypt and encode a message on an EBCDIC machine, and the output consists of an "E" (i.e. 0xC5, and you transmit it to an ASCII machine where you try to decode it, it will be interpreted as an eight-bit non-ASCII character, *not* as "E". In order for this to work, you need an additional step that transfers byte 0xC5 (EBCDIC "E") into byte 0x45 (ASCII "E") otherwise you get junk. That's okay for email, since email is sent in US-ASCII[1], so any EBCDIC machine wanting to send email must convert the header and bodies into US-ASCII, including any Base64 attachments. But the relevance of this to Python is pretty low. > At > http://pubs.opengroup.org/onlinepubs/9699919799/utilities/uuencode.html Python's base64 module is not a re-implementation of the POSIX utility uuencode. The uuencode utility is an application, not a library. It has its own reasons for writing text files encoding using the local environment's default encoding, and it explicitly states that when moving such files to another system, they must be translated: [quote] If it was transmitted over a mail system or sent to a machine with a different codeset, it is assumed that, as for every other text file, some translation mechanism would convert it (by the time it reached a user on the other system) into an appropriate codeset. [end quote] In any case, the POSIX utility uuencode is free to implement whatever high-level behaviour its authors like, just as programming language designers are free to design their Base64 libraries to work how they like. [1] With a few exceptions, such as binary attachments, although not all mail servers can deal with them. -- Steven -- https://mail.python.org/mailman/listinfo/python-list