On Fri, Mar 23, 2018 at 10:47 AM, Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote: > On Fri, 23 Mar 2018 07:09:50 +1100, Chris Angelico wrote: > >>> I was reading though, that JSON files must be encoded with UTF-8. So >>> should I be doing string.decode('latin-1').encode('utf-8')? Or does >>> the json module do that for me when I give it a unicode object? >> >> Reconfigure your MySQL database to use UTF-8. There is no reason to use >> Latin-1 in the database. > > You don't know that. You don't know what technical, compatibility, policy > or historical constraints are on the database.
Okay. Give me a good reason for the database itself to be locked to Latin-1. Make sure you explain how potentially saving the occasional byte of storage (compared to UTF-8) justifies limiting the available character set to the ones that happen to be in Latin-1, yet it's essential to NOT limit the character set to ASCII. >> If that isn't an option, make sure your JSON files are pure ASCII, which >> is the common subset of UTF-8 and Latin-1. > > And that's utterly unnecessary, since any character which can be stored > in the Latin-1 MySQL database can be stored in the Unicode JSON. > Irrelevant; if you fetch eight-bit data out of the database, it isn't going to be a valid JSON file unless (1) it's really ASCII, like I suggest; (2) you re-encode it to UTF-8; or (3) it was actually UTF-8 all along, despite being declared as Latin-1. Restricting JSON to ASCII is a very easy and common thing to do. It just means that every non-ASCII character gets represented as a \u escape sequence. In Python's JSON encoder, that's the ensure_ascii parameter. Utterly unnecessary? How about standards-compliant and entirely effective, unlike the re-encoding that means that the database-stored blob is invalid JSON and must be re-encoded again on the way out? ChrisA -- https://mail.python.org/mailman/listinfo/python-list