On Fri, Jul 14, 2017 at 4:30 PM, Marko Rauhamaa <ma...@pacujo.net> wrote: > Unicode was supposed to get us out of the 8-bit locale hole. Now it > seems the Unicode hole is far deeper and we haven't reached the bottom > of it yet. I wonder if the hole even has a bottom. > > We now have: > > - an encoding: a sequence a bytes > > - a string: a sequence of integers (code points) > > - "a snippet of text": a sequence of characters
Before Unicode, we had exactly the same thing, only with more encodings. > Assuming "a sequence of characters" is the final word, and Python wants > to be involved in that business, one must question the usefulness of > strings, which are neither here nor there. > > When people use Unicode, they are expecting to be able to deal in real > characters. I would expect: > > len(text) to give me the length in characters > text[-1] to evaluate to the last character > re.match("a.c", text) to match a character between a and c > > So the question is, should we have a third type for text. Or should the > semantics of strings be changed to be based on characters? What is the length of a string? How often do you actually care about the number of grapheme clusters - and not, for example, about the pixel width? (To columnate text, for instance, you need to know about its width in pixels or millimeters, not the number of characters in the line.) And if you're going to group code points together because some of them are combining characters, would you also group them together because there's a zero-width joiner in the middle? The answer will sometimes be "yes of course" and sometimes "of course not". These kinds of linguistic considerations shouldn't be codified into the core of the language. IMO the Python str type is adequate as a core data type. What we may need, though, is additional utility functions, eg: * unicodedata.grapheme_clusters(str) - split str into a sequence of grapheme clusters * pango.get_text_extents(str) - measure the pixel dimensions of a line of text * platform.punish_user() - issue a platform-dependent response (such as an electric shock, a whack with a 2x4, or a dropped anvil) on someone who has just misunderstood Unicode again * socket.punish_user() - as above, but to the user at the opposite end of a socket ChrisA -- https://mail.python.org/mailman/listinfo/python-list