Steven D'Aprano <steve+pyt...@pearwood.info> added the comment:
> I think it would be a mistake to make the stdlib use this for most > notions of what a "character" is, as I said this notion is also > inaccurate. Having an iterator library somewhere that you can use and > compose is great, changing the internal workings of string operations > would be a major change, and not entirely productive. Agreed. I won't pretend to be able to predict what Python 5.0 will bring *wink* but there's too much history around the "code point = character" notion for the language to change now. If the language can expose a grapheme iterator, then people can experiment with grapheme-based APIs in libraries. (By grapheme I mean "extended grapheme cluster", but that's a mouthful. Sorry linguists.) What do you think of these as a set of grapheme primitives? (1) is_grapheme_break(string, i) Return True if a grapheme break would occur *before* string[i]. (2) graphemes(string, start=0, end=len(string)) Iterate over graphemes in string[start:end]. (3) graphemes_reversed(string, start=0, end=len(string)) Iterate over graphemes in reverse order. I *think* is_grapheme_break would be enough for people to implement their own versions of graphemes and graphemes_reversed. Here's an untested version: def graphemes(string, start, end): cluster = [] for i in range(start, end): c = string[i] if is_grapheme_break(string, i): if i != start: # don't yield the empty cluster at Start Of Text yield ''.join(cluster) cluster = [c] else: cluster.append(c) if cluster: yield ''.join(cluster) Regarding is_grapheme_break, if I understand the note here: https://www.unicode.org/reports/tr29/#Testing one never needs to look at more than two adjacent code points to tell whether or not a grapheme break will occur between them, so this ought to be pretty efficient. At worst, it needs to look at string[i-1] and string[i], if they exist. ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue30717> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com