On Sat, Mar 24, 2018 at 1:46 AM, Tobiah <t...@tobiah.org> wrote: > On 03/22/2018 12:46 PM, Tobiah wrote: >> >> I have some mailing information in a Mysql database that has >> characters from various other countries. The table says that >> it's using latin-1 encoding. I want to send this data out >> as JSON. >> >> So I'm just taking each datum and doing 'name'.decode('latin-1') >> and adding the resulting Unicode value right into my JSON structure >> before doing .dumps() on it. This seems to work, and I can consume >> the JSON with another program and when I print values, they look nice >> with the special characters and all. >> >> I was reading though, that JSON files must be encoded with UTF-8. So >> should I be doing string.decode('latin-1').encode('utf-8')? Or does >> the json module do that for me when I give it a unicode object? > > > > Thanks for all the discussion. A little more about our setup: > We have used a LAMP stack system for almost 20 years to deploy > hundreds of websites. The database tables are latin-1 only because > at the time we didn't know how or care to change it. > > More and more, 'special' characters caused a problem. They would > not come out correctly in a .csv file or wouldn't print correctly. > Lately, I noticed that a JSON file I was sending out was delivering > unreadable characters. That's when I started to look into Unicode > a bit more. From the discussion, and my own guesses, it looks > as though all have to do is string.decode('latin-1'), and stuff > that Unicode object right into my structure that gets handed to > json.dumps().
Yep, this is sounding more and more like you need to go UTF-8 everywhere. > If I changed my database tables to all be UTF-8 would this > work cleanly without any decoding? Whatever people are doing > to get these characters in, whether it's foreign keyboards, > or fancy escape sequences in the web forms, would their intended > characters still go into the UTF-8 database as the proper characters? > Or now do I have to do a conversion on the way in to the database? The best way to do things is to let your Python-MySQL bridge do the decoding for you; you'll simply store and get back Unicode strings. That's how things happen by default in Python 3 (I believe; been a while since I used MySQL, but it's like that with PostgreSQL). My recommendation is to give it a try; most likely, things will just work. > We also get import data that often comes in .xlsx format. What > encoding do I get when I dump a .csv from that? Do I have to > ask the sender? I already know that they don't know. Ah, now, that's a potential problem. A CSV file can't tell you what encoding it's in. Fortunately, UTF-8 is designed to be fairly dependable: if you attempt to decode something as UTF-8 and it works, you can be confident that it really is UTF-8. But ultimately, you have to just ask the person who exports it: "please export it in UTF-8". Generally, things should "just work" as long as you're consistent with encodings, and the easiest way to be consistent is to use UTF-8 everywhere. It's a simple rule that everyone can follow. (Hopefully. :) ) ChrisA -- https://mail.python.org/mailman/listinfo/python-list