Re: UTF-8 and latin1

Barry Wed, 17 Aug 2022 13:21:57 -0700


> On 17 Aug 2022, at 18:30, Jon Ribbens via Python-list 
> <[email protected]> wrote:
> 
> On 2022-08-17, Tobiah <[email protected]> wrote:
>> I get data from various sources; client emails, spreadsheets, and
>> data from web applications.  I find that I can do 
>> some_string.decode('latin1')
>> to get unicode that I can use with xlsxwriter,
>> or put <meta charset="latin1"> in the header of a web page to display
>> European characters correctly.  But normally UTF-8 is recommended as
>> the encoding to use today.  latin1 works correctly more often when I
>> am using data from the wild.  It's frustrating that I have to play
>> a guessing game to figure out how to use incoming text.   I'm just wondering
>> if there are any thoughts.  What if we just globally decided to use utf-8?
>> Could that ever happen?
> 
> That has already been decided, as much as it ever can be. UTF-8 is
> essentially always the correct encoding to use on output, and almost
> always the correct encoding to assume on input absent any explicit
> indication of another encoding. (e.g. the HTML "standard" says that
> all HTML files must be UTF-8.)
> 
> If you are finding that your specific sources are often encoded with
> latin-1 instead then you could always try something like:
> 
>    try:
>        text = data.decode('utf-8')
>    except UnicodeDecodeError:
>        text = data.decode('latin-1')
> 
> (I think latin-1 text will almost always fail to be decoded as utf-8,
> so this would work fairly reliably assuming those are the only two
> encodings you see.)


Only if a reserved byte is used in the string.
It will often work in either.

For web pages it cannot be assumed that markup saying it’s utf-8 is
correct. Many pages are I fact cp1252. Usually you find out because
of a smart quote that is 0xa0 is cp1252 and illegal in utf-8.

Barry


> 
> Or you could use something fancy like https://pypi.org/project/chardet/
> 
> -- 
> https://mail.python.org/mailman/listinfo/python-list
> 

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: UTF-8 and latin1

Reply via email to