On Sun, Jul 15, 2018 at 9:17 PM, Marko Rauhamaa <ma...@pacujo.net> wrote: > Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info>: >> - to have a language where text strings are a second-class >> data type, not available in the language itself, only in >> the libraries; > > Unicode code point strings *ought* to be a second--class data type. They > were a valiant idea but in the end turned out to be a mistake.
So let's see. Suppose I go to a web site that asks me to type in a title; I enter something and hit "Save". That title goes through JavaScript, gets sent to the back-end API via AJAX and JSON, received by a Python web app, and saved into a database. Later, it gets retrieved from that database and displayed to me on the same web page, where I click on it, and it gets put into an input field. I then submit it using standard form fill-out (no JS), it is received by the web app, and then gets sent to the Twitch.tv API to become my stream title. I look at my stream, and the title is the exact string that I entered originally. During this time, I consider that string to be text. Always text. But if Unicode strings are second-class data, then that title changed from being text (in the input box) to UTF-16 (in JS) to UTF-8 (in JSON) to UTF-32 (in Python) to UTF-8 (in the database) to UTF-32 (retrieved into Python later) to ASCII with "\uXXXX" escapes (being sent to the web page) to text (in the input box). Then it gets converted to URL-encoded UTF-8 (form submission), then UTF-8, and UTF-32 (retrieval in Python), then UTF-8 (Twitch API), and finally back to text (displayed on the screen). Remind me how it's such a mistake to treat that string as text all the way through? >> - to have a language where text characters are *literally* >> 32-bit integers ("rune" is an alias to int32); >> >> (you can multiple a linefeed by a grave accent and get pi) > > Again, that has barely anything to do with the topic at hand. I don't > think there's any unproblematic way to capture a true text character, > period. Python3 certainly hasn't been able to capture it. Python's Unicode type is an accurate representation of a Unicode text string, just as Python's float type is an accurate representation of IEEE 754 floating-point. Just as floats are not reals, so too is Unicode not perfectly able to represent all human text, and has to mess around with things like combining characters. It's not 100% perfect (https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/ point #11), but it's about as close as you'll ever get inside a computer. >>> That's the ten-billion-dollar question, isn't it?! >> >> No. The real ten billion dollar question is how people in 2018 can >> stick their head in the sand and take seriously the position that >> Latin-1 (let alone ASCII) is enough for text strings. > > Here's the deal: text strings are irrelevant for most modern programming > needs. Most software is middleware between the human and the terminal > device. Carrying opaque octet strings from end to end is often the most > correct and least problematic thing to do. Uhh, so the human uses byte/octet strings? You can argue that the terminal device is fundamentally byte-oriented, but if you do, I'm going to dispute the use of the definite article, and say that *many* terminal devices are byte-oriented as of today. There's no fundamental reason for that to remain the case, and even today, we have fundamentally text-oriented terminal devices. I know this because I maintain one (okay, it's called a "MUD client" rather than a "terminal device", but it's basically the same thing). > On the other hand, Python3's code point strings mess things up for no > added value. You still can't upcase or downcase strings. Not entirely sure what the .upper() and .lower() methods do, then. Case conversion of arbitrary text strings is hard, but Python definitely gives you as good as you'll ever get without actually stipulating, not just the language, but the context. > You still can't sort strings. Strings are intrinsically totally ordered in a mostly-sane way. If you want anything more than that, you have to stipulate the language. Python offers this in the 'locale' module, with strcoll and strxfrm. > You still can't perform random access on strings. Say what? > You still don't know how long your string is. How long is a piece of string? 1) Do you count code points? len(x) 2) Do you count code units? len(x.encode("...")) 3) Do you count base characters, ignoring combining characters? 4) Do you count pixels of display width? 5) Do you count advancement (like pixels, but negative for RTL text)? Two of them are easy. Two require font metrics (so they're the job of a display engine). Only #3 is moderately hard, and you could do that with a one-liner by checking the Unicode categories. But it isn't very useful except to "prove" that Python sucks. > You still don't know where you can break a string safely. Impossible without language-based and font-based information. For instance, in the string "python", you cannot break the string between the "t" and the "h", because they are parts of one phonogram. Splitting the string "اطلقي سرك" anywhere other than at the space will result in the two halves displaying differently from the combined whole, because of the way Arabic text is written. Python lets you split the string between any two code points, a massive step up from exposing UTF-8 or UTF-16 code units, so that's about as safe as it gets. > You still don't know how to normalize a string. You mean unicodedata.normalize? Yeah, you're right, I don't know how to do it. I can never remember whether it's normalize(str, "NFC") or normalize("NFC", str). > You still don't know if two strings are equal or not. Do an NFD or NFKD normalization on both strings, then compare. > You still don't know how to concatenate strings. Uhh.... s1 + s2? I'm fairly sure you have no clue about Unicode or Python, but I'll give you the benefit of the doubt and assume you're merely trolling. ChrisA -- https://mail.python.org/mailman/listinfo/python-list