> On 17 Aug 2022, at 18:30, Jon Ribbens via Python-list > <python-list@python.org> wrote: > > On 2022-08-17, Tobiah <t...@tobiah.org> wrote: >> I get data from various sources; client emails, spreadsheets, and >> data from web applications. I find that I can do >> some_string.decode('latin1') >> to get unicode that I can use with xlsxwriter, >> or put <meta charset="latin1"> in the header of a web page to display >> European characters correctly. But normally UTF-8 is recommended as >> the encoding to use today. latin1 works correctly more often when I >> am using data from the wild. It's frustrating that I have to play >> a guessing game to figure out how to use incoming text. I'm just wondering >> if there are any thoughts. What if we just globally decided to use utf-8? >> Could that ever happen? > > That has already been decided, as much as it ever can be. UTF-8 is > essentially always the correct encoding to use on output, and almost > always the correct encoding to assume on input absent any explicit > indication of another encoding. (e.g. the HTML "standard" says that > all HTML files must be UTF-8.) > > If you are finding that your specific sources are often encoded with > latin-1 instead then you could always try something like: > > try: > text = data.decode('utf-8') > except UnicodeDecodeError: > text = data.decode('latin-1') > > (I think latin-1 text will almost always fail to be decoded as utf-8, > so this would work fairly reliably assuming those are the only two > encodings you see.)
Only if a reserved byte is used in the string. It will often work in either. For web pages it cannot be assumed that markup saying it’s utf-8 is correct. Many pages are I fact cp1252. Usually you find out because of a smart quote that is 0xa0 is cp1252 and illegal in utf-8. Barry > > Or you could use something fancy like https://pypi.org/project/chardet/ > > -- > https://mail.python.org/mailman/listinfo/python-list > -- https://mail.python.org/mailman/listinfo/python-list