Ben Finney <ben+pyt...@benfinney.id.au>: > Steve D'Aprano <steve+pyt...@pearwood.info> writes: >> From time to time, people discover that Python's string algorithms >> work on code points rather than "real characters", which can lead to >> anomalies > > [...] >>>> [unicodedata.name(c) for c in reversed(s1)] > ['LATIN SMALL LETTER X', > 'LATIN SMALL LETTER E', > 'LATIN SMALL LETTER A WITH DIAERESIS', > 'LATIN SMALL LETTER X'] >>>> "".join(reversed(s1)) > 'xeäx' >>>> [unicodedata.name(c) for c in reversed(s2)] > ['LATIN SMALL LETTER X', > 'LATIN SMALL LETTER E', > 'COMBINING DIAERESIS', > 'LATIN SMALL LETTER A', > 'LATIN SMALL LETTER X'] >>>> "".join(reversed(s2)) > 'xëax'
Unicode was supposed to get us out of the 8-bit locale hole. Now it seems the Unicode hole is far deeper and we haven't reached the bottom of it yet. I wonder if the hole even has a bottom. We now have: - an encoding: a sequence a bytes - a string: a sequence of integers (code points) - "a snippet of text": a sequence of characters Assuming "a sequence of characters" is the final word, and Python wants to be involved in that business, one must question the usefulness of strings, which are neither here nor there. When people use Unicode, they are expecting to be able to deal in real characters. I would expect: len(text) to give me the length in characters text[-1] to evaluate to the last character re.match("a.c", text) to match a character between a and c So the question is, should we have a third type for text. Or should the semantics of strings be changed to be based on characters? Marko -- https://mail.python.org/mailman/listinfo/python-list