On Mon, Jan 6, 2014 at 3:24 PM, Roy Smith <r...@panix.com> wrote: > I've never used Python 3, so forgive me if these are naive questions. > Let's say you had an input stream which contained the following hex > values: > > $ hexdump data > 0000000 d7 a8 a3 88 96 95 > > That's EBCDIC for "Python". What would I write in Python 3 to read that > file and print it back out as utf-8 encoded Unicode?
*deletes the two paragraphs that used to be here* Turns out Python 3 _does_ have an EBCDIC decoder... but it's not called EBCDIC. >>> b"\xd7\xa8\xa3\x88\x96\x95".decode("cp500") 'Python' This sounds like a good one for getting an alias, either "ebcdic" or "EBCDIC". I didn't know that this was possible till I googled the problem and saw someone else's solution. To print that out as UTF-8, just decode and then encode: >>> b"\xd7\xa8\xa3\x88\x96\x95".decode("cp500").encode("utf-8") b'Python' In the specific case of files on the disk, you could open them with encodings specified, in which case you don't need to worry about the details. with open("data",encoding="cp500") as infile: with open("data_utf8","w",encoding="utf-8") as outfile: outfile.write(infile.read()) Of course, this is assuming that Unicode has a perfect mapping for every EBCDIC character. I'm not familiar enough with EBCDIC to be sure that that's true, but I strongly suspect it is. And if it's not, you'll get an exception somewhere along the way, so you'll know something's gone wrong. (In theory, a "transcode" function might be able to give you a warning before it even sees your data - transcode("utf-8", "iso-8859-3") could alert you to the possibility that not everything in the source character set can be encoded. But that's a pretty esoteric requirement.) > Or, how about a slightly different example: > > $ hexdump data > 0000000 43 6c 67 75 62 61 > > That's "Python" in rot-13 encoded ascii. How would I turn that into > cleartext Unicode in Python 3? That's one of the points that's under dispute. Is rot13 a bytes<->bytes encoding, or is it str<->str, or is it bytes<->str? The issue isn't clear. Personally, I think it makes good sense as a str<->str translation, which would mean that the process would be somewhat thus: >>> rot13={} >>> for i in range(13): rot13[65+i]=65+i+13 rot13[65+i+13]=65+i rot13[97+i]=97+i+13 rot13[97+i+13]=97+i >>> data = b"\x43\x6c\x67\x75\x62\x61" # is there an easier way to turn a hex >>> dump into a bytes literal? >>> data.decode().translate(rot13) 'Python' This is treating rot13 as a translation of Unicode codepoints to other Unicode codepoints, which is different from an encode operation (which takes abstract Unicode data and produces concrete bytes) or a decode operation (which does the reverse). But this is definitely a grey area. It's common for cryptographic algorithms to work with bytes, meaning that their "decoded" text is still bytes. (Or even less than bytes. The famous Enigma machines from World War II worked with the 26 letters as their domain and range.) Should the Python codecs module restrict itself to the job of translating between bytes and str, or is it a tidy place to put those other translations as well? ChrisA -- https://mail.python.org/mailman/listinfo/python-list