On Mon, 13 Jun 2016 01:20 pm, Random832 wrote: > On Sun, Jun 12, 2016, at 22:22, Steven D'Aprano wrote: >> That's because base64 is a bytes-to-bytes transformation. It has >> nothing to do with unicode encodings. > > Nonsense. base64 is a binary-to-text encoding scheme. The output range > is specifically chosen to be safe to transmit in text protocols.
"Safe to transmit in text protocols" surely should mean "any Unicode code point", since all of Unicode is text. What's so special about the base64 ones? Well, that depends on your context. For somebody who cares about sending bits over a physical wire, their idea of "text" is not Unicode, but a subset of ASCII *bytes*. The end result is that after you've base64ed your "binary" data, to get "text" data, what are you going to do with is? Treat it as Unicode code points? Probably not. Squirt it down a wire as bytes? Almost certainly. Looking at this from the high-level perspective of Python, that makes it conceptually bytes not text. Yes, I know that there's a terminology clash between communication engineers and the programmers who work in their world, and the rest of us. We use "text" to mean Unicode[1], they use "text" to mean "roughly 100 of the 128 bytes with the high-bit cleared, interpreted as ASCII". But those folks are unlikely to be asking why base64 encoding a bunch of bytes returns bytes. They *want* it to return bytes, because that's what they're going to squirt down the wire. If you gave them Unicode, encoded using (say) UTF-16 or UTF-32, they're likely to say "WTF are you giving me this binary data for? Look at all these NUL bytes, what am I supposed to do with them?!?!". (If they could cope with arbitrary bytes, they wouldn't have base64 encoded it.) And if you gave them UTF-8, well, how would anyone know? With base64 encoded data, it's all a subset of ASCII. Python defines a nice clean separation between text (Unicode) and binary data (bytes). Under that model, base64 is a transformation between unrestricted bytes 0...255 to a restricted subset of bytes that matches some ASCII encoded text. It shouldn't return a Unicode string, because that's an abstract text format and we can't make any assumptions about the implementation. Say you base64 encode some binary data: py> base64.b64encode(b'\x01A\x11\x16') b'AUERFg==' Suppose instead it returned the Unicode string 'AUERFg=='. That's all well and good, but what are you going to do with it? You can't transmit it over a serial cable, because that almost surely is going to expect bytes, so you have to encode it. You can't embed it in an email, because that also expects bytes. You could write it to a file. If the file is opened in binary mode, you have to encode the Unicode string to bytes before you can write it. If the file is opened in text mode, Python will accept your Unicode string and encode it for you, which could introduce non-base64 characters into the file. Consider if the file was opened using UTF-16: \x00A\x00U\x00E\x00R\x00F\x00g\x00=\x00= hardly counts as base64 in any meaningful sense. So while I complete accept your comment about "text protocols" in the context of the networking world, we're not in the networking world. We're in the high-level programming language world of Python, where text does not mean a subset of ASCII bytes, it means Unicode. And in *our* world, having base64 return text is a mistake. [1] Or at least we should, since the idea that only American English[2] counts as text cannot possibly survive in the 21st Century when we're connected to the entire world of different languages. Although I'd allow TRON as well, if you can actually find any TRON users outside of Japan.[3] [2] And only a subset of American English at that. [3] Or inside Japan for that matter. -- Steven -- https://mail.python.org/mailman/listinfo/python-list