On Wed, 15 Jan 2014 12:00:51 +0000, Robin Becker wrote: > so two 'characters' are 3 (or 2 or more) codepoints.
Yes. > If I want to isolate so called graphemes I need an algorithm even > for python's unicode Correct. Graphemes are language dependent, e.g. in Dutch "ij" is usually a single grapheme, in English it would be counted as two. Likewise, in Czech, "ch" is a single grapheme. The Latin form of Serbo-Croation has two two-letter graphemes, Dž and Nj (it used to have three, but Dj is now written as Đ). Worse, linguists sometimes disagree as to what counts as a grapheme. For instance, some authorities consider the English "sh" to be a separate grapheme. As a native English speaker, I'm not sure about that. Certainly it isn't a separate letter of the alphabet, but on the other hand I can't think of any words containing "sh" that should be considered as two graphemes "s" followed by "h". Wait, no, that's not true... compound words such as "glasshouse" or "disheartened" are counter examples. > ie when it really matters, python3 str is just another encoding. I'm not entirely sure how a programming language data type (str) can be considered a transformation. -- Steven -- https://mail.python.org/mailman/listinfo/python-list