On Fri, 23 Mar 2018 07:46:16 -0700, Tobiah wrote: > If I changed my database tables to all be UTF-8 would this work cleanly > without any decoding?
Not reliably or safely. It will appear to work so long as you have only pure ASCII strings from the database, and then crash when you don't: py> text_from_database = u"hello wörld".encode('latin1') py> print text_from_database hello w�rld py> json.dumps(text_from_database) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python2.7/json/__init__.py", line 231, in dumps return _default_encoder.encode(obj) File "/usr/local/lib/python2.7/json/encoder.py", line 195, in encode return encode_basestring_ascii(o) UnicodeDecodeError: 'utf8' codec can't decode byte 0xf6 in position 7: invalid start byte > Whatever people are doing to get these characters > in, whether it's foreign keyboards, or fancy escape sequences in the web > forms, would their intended characters still go into the UTF-8 database > as the proper characters? Or now do I have to do a conversion on the way > in to the database? There is no way to answer that, because it depends on how you are getting the characters, what you are doing to them, and how you put them in the database. In the best possible scenario, your process is: - user input comes in as UTF-8; - you store it in the database; - which converts it to Latin-1 (sometimes losing data: see below) in which case, changing the database field to utf8mb4 (NOT plain utf8, thanks to a ludicrously idiotic design flaw in MySQL utf8 is not actually utf8) should work nicely. I mentioned losing data: if your user enters, let's say the Greek letters 'αβγ' (or emojis, or any of about a million other characters) then Latin1 cannot represent them. Presumably your database is throwing them away: py> s = 'αβγ' # what the user wanted py> db = s.encode('latin-1', errors='replace') # what the database recorded py> json.dumps(db.decode('latin-1')) # what you end up with '"???"' Or, worse, you're getting moji-bake: py> s = 'αβγ' # what the user wanted py> json.dumps(s.encode('utf-8').decode('latin-1')) '"\\u00ce\\u00b1\\u00ce\\u00b2\\u00ce\\u00b3"' > We also get import data that often comes in .xlsx format. What encoding > do I get when I dump a .csv from that? Do I have to ask the sender? I > already know that they don't know. They never do :-( In Python 2, I believe the CSV module will assume ASCII-plus-random-crap, and it will work fine so long as it actually is ASCII. Otherwise you'll get random-crap: possibly an exception, possibly moji-bake. The sad truth is that as soon as you leave the nice, clean world of pure Unicode, and start dealing with legacy encodings, everything turns to quicksand. If you haven't already done so, you really should start by reading Joel Spolsky's introduction to Unicode: http://global.joelonsoftware.com/English/Articles/Unicode.html and Ned Batchelder's post on dealing with Unicode and Python: https://nedbatchelder.com/text/unipain.html -- Steve -- https://mail.python.org/mailman/listinfo/python-list