On 7/15/05, Bruce Dembecki <[EMAIL PROTECTED]> wrote:
> This process has worked for us taking our latin1 4.0 databases and
> turning them into utf8 4.1 databases. UTF8 data we had already put in
> our 4.0 database despite it's latin1 encoding was correctly exported
> out of 4.0 and correctly converted on it's way in to 4.1


Wow!  Really?  This part amazes me.

So the MySQL importing process seems to do the converting of the
special characters into the newly-defined encoding format?  (from
latin1 to utf-8 in my case)

See - we do webhosting for clients around the world, and right now our
default-encoded MySQL 4.0 databases have Swedish and Hebrew characters
in them.

I'm concerned that if I dumped them as latin1/default in 4.0, but then
imported as utf-8 in 4.1 that the non-ASCII characters would get
imported as the wrong encoding.

(Assuming, yes, that I would set our new 4.1 databases to do ALL utf-8
in the /etc/my.cnf and gladly take the small performance/size hit.)


OK, this is where we are at, and what it does for us... I can't speak to where you are at and what it will do for you, however, we have this:

4.0 data everything is encoded by MySQL as Latin1. However the web app and the JDBC both speak UTF8, so the data being given to the database is UTF8 data which the database then stores in it's Latin1 table. When the JDBC extracts the data from 4.0 and gives it to the web app it displays as we would expect in the language it was entered... Was very cool and worked great for us.

Then we went to 4.1... just changed the Binary, expecting things to work the way they did... wrong. Suddenly all our databases with strong non latin1 character sets were in trouble. Our Chinese boards were a mess, as were our German boards... even our Australian board which should have been fine was a mess because one of the areas was titled "Australia Café!" and the word was in there a lot. Took about three months and many hours with MYSQL folks to figure it out. The upshot however was if you went about the conversion by doing a dump and an import, and made sure to tell the import to treat the specific columns as UTF8, everything arrived in fine shape. (There was one set of data loss because it turned out one of the Chinese boards was set to Big8 and the JDBC was set to UTF8 and the database Latin1... Get's to a point where there's only so much encoding it can take)

That said. the reality is if your application is anything like most the majority of the columns are for the application to work with, very few actually deal with text that needs to be encoded. The rest are time stamps, id#s, references to help you track the data... there's likely not much if any benefit to having them UTF8 encoded. For us we have 80 databases with almost that many tables with many columns.. figuring out one by one which way is up on everything represents a challenge. We're going to go back and do it now. But if I had one or two databases, I'd start by leaving the database the default (Latin1) and doing the database create and importing the .sql files (which creates the tables), then go through and run ALTER TABLE a few times and set the specific columns to UTF8... do the import the same as in my script... Latin1 text encodes in UTF8 to... Latin1 text, so you shouldn't have any problem, but I'm guessing here of course. Dump your data to disk and bring it up on another server, try different things and fidn out what works best for you before you do the real import.

It SHOULD be pretty painless if you manage the import and use the mysql tools with the appropriate flag set for character sets on the way in... the only thing that will be painful about the process is if you just upgrade the binaries and expect the thing to work the way it did before, then you'll have some pain :-)

Best Regards, Bruce
--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Reply via email to