On Tue, Apr 2, 2013 at 4:07 AM, Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote: > On Mon, 01 Apr 2013 08:15:53 -0400, Roy Smith wrote: >> It turns out, the problem is that the version of MySQL we're using > > Well there you go. Why don't you use a real database? > > http://www.postgresql.org/docs/9.2/static/multibyte.html > > :-) > > Postgresql has supported non-broken UTF-8 since at least version 8.1.
Not only that, but I *rely* on PostgreSQL to test-or-reject stuff that comes from untrustworthy languages, like PHP. If it's malformed in any way, it won't get past the database. >> doesn't support non-BMP characters. Newer versions do (but you have to >> declare the column to use the utf8bm4 character set). I could upgrade >> to a newer MySQL version, but it's just not worth it. > > My brain just broke. So-called "UTF-8" in MySQL only includes up to a > maximum of three-byte characters. There has *never* been a time where > UTF-8 excluded four-byte characters. What were the developers thinking, > arbitrarily cutting out support for 50% of UTF-8? Steven, you punctuated that wrongly. What, were the developers *thinking*? Arbitrarily etc? It really is brain-breaking. I could understand a naive UTF-8 codec being too permissive (allowing over-long encodings, allowing codepoints above what's allocated (eg FA 80 80 80 80, which would notionally represent U+2000000), etc), but why should it arbitrarily stop short? There must have been some internal limitation - that, perhaps, collation was defined only within the BMP. ChrisA -- http://mail.python.org/mailman/listinfo/python-list