On Sat, Jan 4, 2014 at 1:57 AM, Roy Smith <r...@panix.com> wrote: > I was doing a project a while ago importing 20-something million records > into a MySQL database. Little did I know that FOUR of those records > contained astral characters (which MySQL, at least the version I was > using, couldn't handle). > > My way of dealing with those records was to nuke them. Longer term we > ended up switching to Postgress.
Look! Postgres means you don't lose data!! Seriously though, that's a much better long-term solution than destroying data. But MySQL does support the full Unicode range - just not in its "UTF8" type. You have to specify "UTF8MB4" - that is, "maximum bytes 4" rather than the default of 3. According to [1], the UTF8MB4 encoding is stored as UTF-16, and UTF8 is stored as UCS-2. And according to [2], it's even possible to explicitly choose the mindblowing behaviour of UCS-2 for a data type that calls itself "UTF8", so that a vague theoretical subsequent version of MySQL might be able to make "UTF8" mean UTF-8, and people can choose to use the other alias. To my mind, this is a bug with backward-compatibility concerns. That means it can't be fixed in a point release. Fine. But the behaviour change is "this used to throw an error, now it works". Surely that can be fixed in the next release. Or surely a version or two of deprecating "UTF8" in favour of the two "MB?" types (and never ever returning "UTF8" from any query), followed by a reintroduction of "UTF8" as an alias for MB4, and the deprecation of MB3. Or am I spoiled by the quality of Python (and other) version numbering, where I can (largely) depend on functionality not changing in point releases? ChrisA [1] http://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8mb4.html [2] http://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8mb3.html -- https://mail.python.org/mailman/listinfo/python-list