On 11 August 2013 10:09, Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote: > The reason some accented letters have single code point forms is to > support legacy charsets; the reason some only exist as combining > characters is due to the combinational explosion. Some languages allow > you to add up to five or six different accent on any of dozens of > different letters. If each combination needed its own unique code point, > there wouldn't be enough code points. For bonus points, if there are five > accents that can be placed in any combination of zero or more on any of > four characters, how many code points would be needed?
52? > Note that the form you used, b"caf\x65\xCC\x81", is the same as the first > except that you have shown "e" in hex for some reason: > > py> b'\x65' == b'e' > True Yeah.. I did that because the linked post did it. I'm not sure why either ;). > On Sun, 11 Aug 2013 07:17:42 +0100, Joshua Landau wrote: >> >> So the solution is: >> >> >>> import unicodedata >> >>> len(unicodedata.normalize("NFC", tweet)) >> 4 > > In this particular case, this will reduce the tweet to the normalised > form that Twitter uses. > > [...] >> After further testing (I don't actually use Twitter) it seems the whole >> thing was just smoke and mirrors. The linked article is a lie, at least >> on the user's end. > > Which linked article? The one on dev.twitter.com seems to be okay to me. That's the one. > Of course, they might be lying when they say "Twitter counts the length > of a Tweet using the Normalization Form C (NFC) version of the text", I > have no idea. But the seem to have a good grasp of the issues involved, > and assuming they do what they say, at least Western European users > should be happy. They *don't* seem to be doing what they say. >> On Linux you can prove this by running: >> >> >>> p = subprocess.Popen(['xsel', '-bi'], stdin=subprocess.PIPE) >> >>> p.communicate(input=b"caf\x65\xCC\x81") >> (None, None) >> >> "café" will be in your Copy-Paste buffer, and you can paste it in to >> the tweet-box. It takes 5 characters. So much for testing ;). > > How do you know that it takes 5 characters? Is that some Javascript > widget? I'd blame buggy Javascript before Twitter. I go to twitter.com, log in and press that odd blue compose button in the top-right. After pasting at says I have 135 (down from 140) characters left. My only question here is, since you can't post after 140 non-normalised characters, who cares if the server counts it as less? > If this shows up in your application as café rather than café, it is a > bug in the text rendering engine. Some applications do not deal with > combining characters correctly. Why the rendering engine? > (It's a hard problem to solve, and really needs support from the font. In > some languages, the same accent will appear in different places depending > on the character they are attached to, or the other accents there as > well. Or so I've been lead to believe.) > > >> ¹ https://dev.twitter.com/docs/counting- >> characters#Definition_of_a_Character > > Looks reasonable to me. No obvious errors to my eyes. *Not sure whether talking about the link or my post* -- http://mail.python.org/mailman/listinfo/python-list