On 2018-07-15 15:52, Steven D'Aprano wrote: > On Sun, 15 Jul 2018 14:17:51 +0300, Marko Rauhamaa wrote: > >> Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info>: >> >>> On Sun, 15 Jul 2018 11:43:14 +0300, Marko Rauhamaa wrote: >>>> Paul Rubin <no.email@nospam.invalid>: >>>>> I don't think Go is the answer either, but it probably got strings >>>>> right. What is the answer? >>> >>> Go strings aren't text strings. They're byte strings. When you say that >>> Go got them right, that depends on your definition of success. >>> >>> If your definition of "success" is: >>> >>> - fail to be able to support 80% + of the world's languages >>> and a majority of the world's text; >> >> Of course byte strings can support at least as many languages as >> Python3's code point strings and at least equally well. > > You cannot possibly be serious. > > There are 256 possible byte values. China alone has over 10,000 different > characters. You can't represent 10,000+ characters using only 256 > distinct code points. > > You can't even represent the world's languages using 16-bit word-strings > instead of byte strings.
I think you're tearing down a straw man here. (So is Marko.) The byte-string-only argument is to use byte strings containing encoded text. This does always work. It's just very easy to make mistakes like double-encoding. The "do what Python 3 does" argument is, as I see it, that it's better to deal with text independently of its encoding, and explicitly converting to and from byte representations. I'm very much in favour, not particularly because it prevents errors (though it does), but because it saves me from having to manage irrelevant details like the encoding of the text in question. Imagine if people made the same argument: "byte strings are better than a representation-independent type" about, say, integers. Using byte strings instead of integers is great! You can roundtrip any integer and not care how it's encoded! You can print it to a terminal or a file or anything without having to pointlessly re-encode it! Okay, so things get a bit hairy if someone uses hex instead of the obviously-superior decimal, but nobody does that. And when they do, you can just bytes.decode('int-hex'). Just remember not to do it more than once, a famously easy problem in programming that has never bitten anyone ever, and you're golden. Look at all the problems this solves! Now we can even parse a file format with integers in it and emit them again without having to know what encoding the integers are, which doesn't actually save us from any encoding headaches because we need to figure out the encoding to work with those integers at all, but will make for good ammunition against those ridiculous integer zealots. On a more serious note, I think this particular aspect of Python causes quite a lot of difficulty for Python 2 programs that make heavy use of the bytes-text duality, and quite a lot of peace of mind for every other case. So, Marko, I don't know what code you work on, but I think it's unfair to attack Python 3's unicode handling too hard if you haven't written a new project with it. -- https://mail.python.org/mailman/listinfo/python-list