The major issue is that "Unicode character" doesn't have a good definition. The most likely definition is a "Unicode code point", however, Windows uses "Unicode character" to mean a UTF-16 byte sequence, which means that any code point above the Basic Multilingual Plane is really composed of two "Unicode characters", which are, of course, surrogate pairs.
This confusion also extends to JavaScript, which composes its String type of "characters" which are actually UTF-16 values. You can see this with astral plane characters like emoji: > "💩".length 2 > "💩" == "\uD83D\uDCA9" true As an example of a grapheme cluster without a precomposed, single-code-point form, look at the Regional Indicators, which were the politics-free way to add flag symbols to the Emoji block. There are 26 code points, "A" through "Z", and when put next to each other in language codes, like "🇺🇸", it's expected that certain combinations will show up as flags, without explicitly defining which one. But a sequence of regional indicator code points is entirely one grapheme cluster. Go drops the term "character" or "code point" entirely and opts for "rune" instead, which is just a 32-bit value. Swift has an even crazier "Character" type [0], which can hold an entire Grapheme Cluster, rather than just a single code-point. This actually means that Swift's "Character" type is of potentially infinite length, since Regional Indicators aren't capped at a maximum of two code points. Unicode is fun. [0] https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html#//apple_ref/doc/uid/TP40014097-CH7-ID285 On Thu, Mar 17, 2016 at 12:42 PM, Matthias Clasen <matthias.cla...@gmail.com> wrote: > On Thu, Mar 17, 2016 at 2:26 PM, Jasper St. Pierre > <jstpie...@mecheye.net> wrote: > >> I'll also ask what "character" means in this case, even though I know >> glib also has the same confusion. Are you talking about the number of >> Unicode code points in the string, or the number of grapheme clusters, >> as defined by Unicode TR29 [0]? The number of code points isn't useful >> for editing in all cases, even after NFC normalization. Some grapheme >> clusters just don't have a single code-point representation. > > I don't think there is any confusion in glib about this, really. > There is no mention of graphemes in GLib at all, its all just > characters. If you want graphemes, you need pango. -- Jasper _______________________________________________ gtk-devel-list mailing list gtk-devel-list@gnome.org https://mail.gnome.org/mailman/listinfo/gtk-devel-list