On Thu, 22 Jun 2017 11:57 pm, Chris Angelico wrote: > On Thu, Jun 22, 2017 at 11:33 PM, Steve D'Aprano > <steve+pyt...@pearwood.info> wrote: >> and besides some Unicode code points are not >> characters at all). >> >> http://www.unicode.org/faq/private_use.html#noncharacters > > AIUI, "noncharacters" are like the IEEE floating point value > "not-a-number".
That's... kinda fair. Although, the Unicode Consortium thinks of them as more like private use characters, only even more private, and not characters :-) (If you ask me, I think the noncharacters exist because "it seemed like a good idea at the time" -- the use-case for them seems particularly ill-defined. I suspect that if we were to redo Unicode from scratch, they wouldn't be included.) > So a character count should normally *include* any noncharacters in the > string. That depends on what you mean by *character*. If you mean "code point", then I agree it should be counted. If you mean "a letter, a digit, an ideograph, emoji, ... " then probably not. (Depends what's in the ellipsis :-) If you mean a grapheme, then certainly not, because the 66 Unicode noncharacters don't belong to any human language. If you mean "a grapheme cluster, or a code point for things which aren't characters from human languages" then I guess they should be counted, as will control characters, formatting marks, surrogate code points, and anything else which doesn't represent a natural language character. What is a natural language character? Is IJ one or two characters? Depends on whether you're Dutch or not ;-) This is why the Unicode standard tries not to talk in terms of "characters". They're not well-defined. -- Steve “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list