On Sun, Aug 11, 2013 at 7:17 AM, Joshua Landau <jos...@landau.ws> wrote:
> Given tweet = b"caf\x65\xCC\x81".decode():
>
>     >>> tweet
>     'café'
>
> But:
>
>     >>> len(tweet)
>     5

You're now looking at the difference between glyphs and combining
characters. Twitter counts combining characters, so when you build one
"thing" out of lots of separately-typed parts, it does count as more
characters.

Read this article for some arguments on the subject, including a
number of references to Twitter itself:

http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to