On Sun, 15 Jul 2018 17:39:55 -0700, Jim Lee wrote: > On 07/15/18 17:18, Steven D'Aprano wrote: >> On Sun, 15 Jul 2018 16:08:15 -0700, Jim Lee wrote: >> >>> Python3 is intrinsically tied to Unicode for string handling. >>> Therefore, the Python programmer is forced to deal with it (in all but >>> trivial cases), rather than given a choice. So I don't understand how >>> I can illustrate my point with Python code since Python won't let me >>> deal with strings without also dealing with Unicode. >> Nonsense. >> >> b"Look ma, a Python 2 style ASCII string." >> >> > As I said, all but trivial cases. > > Do you consider separating Unicode strings from byte strings, having to > decode and encode from one to the other,
If you use nothing but byte strings, you don't need to separate the non- existent text strings from the byte strings, nor do you need to decode or encode. > and knowing which > functions/methods accept one, the other, or both as arguments, That's certainly a real complication, if I may stretch the meaning of the word "complication" beyond breaking point. Surely you are already having to read the documentation of the function to learn what arguments it takes, and what types they are (int or float, list or iterator, 'r' or 'a', etc). If someone can't deal with the question of "unicode or bytes" as well, then perhaps they ought to consider a career change to something less demanding, like politics. If, as you insinuate, all your data is 100% ASCII, then you have nothing to fear. Just treat str(bytes_obj, 'ASCII') bytes(str_obj, 'ASCII') as the equivalent of a cast or coercion, and you won't go wrong. (Of course, in 2018, the number of applications that can truly say all their data is pure ASCII is vanishingly small.) Or use Latin-1, if you want to do the most simple-minded thing that you can to make errors go away, without caring about correctness. But the thing is, that complexity is *inherent in the domain*. You can try to deal with it without Unicode, and as soon as you have users expecting to use more than one code page, you're doomed. > as "not dealing with Unicode"? I don't. Frankly, I do. Dealing with all the vagaries of human text *is* complicated, that's the nature of the beast. Dealing with the complexities of Unicode can be as complex as dealing with the complexities of floating point arithmetic. (But neither of those are even in the same ballpark as dealing with the complexities of *not* using Unicode: legacy code pages and encodings are a nightmare to deal with.) Nevertheless, just as casual users can go a very, very long way just treating floats as the real numbers we learn about in school, and trust that IEEE-754 semantics will mean your answers are "close enough", so the casual user can go a very long way ignoring the complexities of Unicode, so long as they control their own data and know what it is. If you don't know what your data is, then you're doomed, Unicode or no Unicode. (If you don't think that's a problem, if you think that "just treat text as octets" works, then people like you are the reason there is so much mojibake in the world, screwing it up for the rest of us.) -- Steven D'Aprano "Ever since I learned about confirmation bias, I've been seeing it everywhere." -- Jon Ronson -- https://mail.python.org/mailman/listinfo/python-list