On Thu, Jul 20, 2017 at 1:45 AM, Marko Rauhamaa <ma...@pacujo.net> wrote: > So let's assume we will expand str to accommodate the requirements of > grapheme clusters. > > All existing code would still produce only traditional strings. The only > way to introduce the new "super code points" is by invoking the > str.canonical() method: > > text = "hyvää yötä".canonical() > > In this case text would still be a fully traditional string because both > ä and ö are represented by a single code point in NFC. However: > > >>> q = unicodedata.normalize("NFC", "aq̈u") > >>> len(q) > 4 > >>> text = q.canonical() > >>> len(text) > 3 > >>> t[0] > "a" > >>> t[1] > "q̈" > >>> t[2] > "u" > >>> q2 = unicodedata.normalize("NFC", text) > >>> len(q2) > 4 > >>> text.encode() > b'aq\xcc\x88u' > >>> q.encode() > b'aq\xcc\x88u'
Ahh, I see what you're looking at. This is fundamentally very similar to what was suggested a few hundred posts ago: a function in the unicodedata module which yields a string's combined characters as units. So you only see this when you actually want it, and the process of creating it is a form of iterating over the string. This could easily be done, as a class or function in unicodedata, without any language-level support. It might even already exist on PyPI. ChrisA -- https://mail.python.org/mailman/listinfo/python-list