[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

Steven D'Aprano Sat, 26 Oct 2019 09:43:48 -0700

On Fri, Oct 25, 2019 at 08:44:17PM -0700, Ben Rudiak-Gould wrote:

> Nothing good can come of decomposing strings into Unicode code points.


Sure there is. In Python, it's the fastest way to calculate the digit 
sum of an integer. It's also useful for implementing classical 
encryption algorithms, like Playfair.

Introspection, e.g. if I want to know if a string contains any 
surrogates, I can do this:

    any('\uD800' <= c <= '\uDFFF' for c in s)

Of perhaps I want to know if the string contains any "astral 
characters", in which case they aren't safe to pass to a Javascript or 
Tcl script which doesn't handle them correctly:

    any(c > '\uFFFF' for c in s)

How about education? One of the things I can do with strings is:

    for c in string:
        print(unicodedata.name(c))

or possible even just 

    # what is that weird symbol in position five?
    print(unicodedata.name(string[5]))

to find out what that weird character is called, so I can look it up and 
find out what it means. Knowing stuff is good, right?

Or do you think the world would be better off if it was really hard 
and "ugly" (your word) for people like me to find out what code points 
are called and what their meaning is?


Rather than just telling us that we shouldn't be allowed to access code 
points in strings, would you please be explicit about *why* this access 
is a bad thing?

And if code points are "bad", then what should we be allowed to do with 
strings? If code points is too low level, then what is an appropriate 
level?

I guess you're probably going to mention grapheme clusters. (If you 
aren't, then I have no idea what your objection is based on.)


Grapheme clusters are a hard problem to solve, since they are dependent 
on the language and the locale. There's a Unicode algorithm for 
splitting on graphemes, but it ignores the locale differences.

Processing on graphemes is more expensive than on code points. There is, 
as far as I can tell, no O(1) access to graphemes in a string without 
pre-processing them and keeping a list of their indices.

For many people, and for many purposes, paying that extra cost in either 
time or memory is just a total waste, since they're hardly ever going to 
come across a grapheme cluster. Few people have to process completely 
arbitrary strings: their data tends to come from a particular subset of 
natural language strings, and for some such languages, you might go a 
whole lifetime without coming across a grapheme cluster of more than one 
code point.

(This may be slowly changing, even for American English, driven in part 
by the use of emoji and variation selectors.)

If Python came with a grapheme processing API, I would probably use it. 
But in the meantime, the code point API is "good enough" for most things 
I do with strings. And for the rest, graphemes are too low-level: I need 
things like sentences; clauses, words, word stems, prefixes and 
suffixes, syllables etc.

But even if Python had an excellent, fast grapheme API, I would still 
want a nice, clean, fast interface that operates on code-points.


-- 
Steven
_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/OCG64OW4WPVDFUSN3R7AGI6M4NFKGJIP/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

Reply via email to