There's a recent blog post complaining about the lousy support for Unicode text in most programming languages:
http://mortoray.com/2013/11/27/the-string-type-is-broken/ The author, Mortoray, gives nine basic tests to understand how well the string type in a language works. The first four involve "user-perceived characters", also known as grapheme clusters. (1) Does the decomposed string "noe\u0308l" print correctly? Notice that the accented letter ë has been decomposed into a pair of code points, U+0065 (LATIN SMALL LETTER E) and U+0308 (COMBINING DIAERESIS). Python 3.3 passes this test: py> print("noe\u0308l") noël although I expect that depends on the terminal you are running in. (2) If you reverse that string, does it give "lëon"? The implication of this question is that strings should operate on grapheme clusters rather than code points. Python fails this test: py> print("noe\u0308l"[::-1]) leon Some terminals may display the umlaut over the l, or following the l. I'm not completely sure it is fair to expect a string type to operate on grapheme clusters (collections of decomposed characters) as the author expects. I think that is going above and beyond what a basic string type should be expected to do. I would expect a solid Unicode implementation to include support for grapheme clusters, and in that regard Python is lacking functionality. (3) What are the first three characters? The author suggests that the answer should be "noë", in which case Python fails again: py> print("noe\u0308l"[:3]) noe but again I'm not convinced that slicing should operate across decomposed strings in this way. Surely the point of decomposing the string like that is in order to count the base character e and the accent "\u0308" separately? (4) Likewise, what is the length of the decomposed string? The author expects 4, but Python gives 5: py> len("noe\u0308l") 5 So far, Python passes only one of the four tests, but I'm not convinced that the three failed tests are fair for a string type. If strings operated on grapheme clusters, these would be good tests, but it is not a given that strings should. The next few tests have to do with characters in the Supplementary Multilingual Planes, and this is where Python 3.3 shines. (In older versions, wide builds would also pass, but narrow builds would fail.) (5) What is the length of "😸😾"? Both characters U+1F636 (GRINNING CAT FACE WITH SMILING EYES) and U+1F63E (POUTING CAT FACE) are outside the Basic Multilingual Plane, which means they require more than two bytes each. Most programming languages using UTF-16 encodings internally (including Javascript and Java) fail this test. Python 3.3 passes: py> s = '😸😾' py> len(s) 2 (Older versions of Python distinguished between *narrow builds*, which used UTF-16 internally and *wide builds*, which used UTF-32. Narrow builds would also fail this test.) This makes Python one of a very few programming languages which can easily handle so-called "astral characters" from the Supplementary Multilingual Planes while still having O(1) indexing operations. (6) What is the substring after the first character? The right answer is a single character POUTING CAT FACE, and Python gets that correct: py> unicodedata.name(s[1:]) 'POUTING CAT FACE' UTF-16 languages invariable end up with broken, invalid strings containing half of a surrogate pair. (7) What is the reverse of the string? Python passes this test too: py> print(s[::-1]) 😾😸 py> for c in s[::-1]: ... unicodedata.name(c) ... 'POUTING CAT FACE' 'GRINNING CAT FACE WITH SMILING EYES' UTF-16 based languages typically break, again getting invalid strings containing surrogate pairs in the wrong order. The next test involves ligatures. Ligatures are pairs, or triples, of characters which have been moved closer together in order to look better. Normally you would expect the type-setter to handle ligatures by adjusting the spacing between characters, but there are a few pairs (such as "fi" <=> "fi" where type designers provided them as custom-designed single characters, and Unicode includes them as legacy characters. (8) What's the uppercase of "baffle" spelled with an ffl ligature? Like most other languages, Python 3.2 fails: py> 'baffle'.upper() 'BAfflE' but Python 3.3 passes: py> 'baffle'.upper() 'BAFFLE' Lastly, Mortoray returns to noël, and compares the composed and decomposed versions of the string: (9) Does "noël" equal "noe\u0308l"? Python (correctly, in my opinion) reports that they do not: py> "noël" == "noe\u0308l" False Again, one might argue whether a string type should report these as equal or not, I believe Python is doing the right thing here. As the author points out, any decent Unicode-aware language should at least offer the ability to convert between normalisation forms, and Python passes this test: py> unicodedata.normalize("NFD", "noël") == "noe\u0308l" True py> "noël" == unicodedata.normalize("NFC", "noe\u0308l") True Out of the nine tests, Python 3.3 passes six, with three tests being failures or dubious. If you believe that the native string type should operate on code-points, then you'll think that Python does the right thing. If you think it should operate on grapheme clusters, as the author of the blog post does, then you'll think Python fails those three tests. A call to arms ============== As the Unicode Consortium itself acknowledges, sometimes you want to operate on an array of code points, and sometimes on an array of graphemes ("user-perceived characters"). Python 3.3 is now halfway there, having excellent support for code-points across the entire Unicode character set, not just the BMP. The next step is to provide either a data type, or a library, for working on grapheme clusters. The Unicode Consortium provides a detailed discussion of this issue here: http://www.unicode.org/reports/tr29/ If anyone is looking for a meaty project to work on, providing support for grapheme clusters could be it. And if not, hopefully you've learned something about Unicode and the limitations of Python's Unicode support. -- Steven -- https://mail.python.org/mailman/listinfo/python-list