Decoding bytes to text strings in Python 2

2024-06-21 Thread Rayner Lucas via Python-list


I'm curious about something I've encountered while updating a very old 
Tk app (originally written in Python 1, but I've ported it to Python 2 
as a first step towards getting it running on modern systems). The app 
downloads emails from a POP server and displays them. At the moment, the 
code is completely unaware of character encodings (which is something I 
plan to fix), and I have found that I don't understand what Python is 
doing when no character encoding is specified.

To demonstrate, I have written this short example program that displays 
a variety of UTF-8 characters to check whether they are decoded 
properly:

 Example Code 
import Tkinter as tk

window = tk.Tk()

mytext = """
  \xc3\xa9 LATIN SMALL LETTER E WITH ACUTE
  \xc5\x99 LATIN SMALL LETTER R WITH CARON
  \xc4\xb1 LATIN SMALL LETTER DOTLESS I
  \xef\xac\x84 LATIN SMALL LIGATURE FFL
  \xe2\x84\x9a DOUBLE-STRUCK CAPITAL Q
  \xc2\xbd VULGAR FRACTION ONE HALF
  \xe2\x82\xac EURO SIGN
  \xc2\xa5 YEN SIGN
  \xd0\x96 CYRILLIC CAPITAL LETTER ZHE
  \xea\xb8\x80 HANGUL SYLLABLE GEUL
  \xe0\xa4\x93 DEVANAGARI LETTER O
  \xe5\xad\x97 CJK UNIFIED IDEOGRAPH-5B57
  \xe2\x99\xa9 QUARTER NOTE
  \xf0\x9f\x90\x8d SNAKE
  \xf0\x9f\x92\x96 SPARKLING HEART
"""

mytext = mytext.decode(encoding="utf-8")
greeting = tk.Label(text=mytext)
greeting.pack()

window.mainloop()
 End Example Code 

This works exactly as expected, with all the characters displaying 
correctly.

However, if I comment out the line 'mytext = mytext.decode
(encoding="utf-8")', the program still displays *almost* everything 
correctly. All of the characters appear correctly apart from the two 
four-byte emoji characters at the end, which instead display as four 
characters. For example, the "SNAKE" character actually displays as:
U+00F0 LATIN SMALL LETTER ETH
U+FF9F HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK
U+FF90 HALFWIDTH KATAKANA LETTER MI
U+FF8D HALFWIDTH KATAKANA LETTER HE

What's Python 2 doing here? sys.getdefaultencoding() returns 'ascii', 
but it's clearly not attempting to display the bytes as ASCII (or 
cp1252, or ISO-8859-1). How is it deciding on some sort of almost-but-
not-quite UTF-8 decoding?

I am using Python 2.7.18 on a Windows 10 system. If there's any other 
relevant information I should provide please let me know.

Many thanks,
Rayner
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Decoding bytes to text strings in Python 2

2024-06-21 Thread Chris Angelico via Python-list
On Sat, 22 Jun 2024 at 03:28, Rayner Lucas via Python-list
 wrote:
> I'm curious about something I've encountered while updating a very old
> Tk app (originally written in Python 1, but I've ported it to Python 2
> as a first step towards getting it running on modern systems).
>
> I am using Python 2.7.18 on a Windows 10 system. If there's any other
> relevant information I should provide please let me know.

Unfortunately, you're running into one of the most annoying problems
from Python 2 and Windows: "narrow builds". You don't actually have
proper Unicode support. You have a broken implementation that works
for UCS-2 but doesn't actually support astral characters.

If you switch to a Linux system, it should work correctly, and you'll
be able to migrate the rest of the way onto Python 3. Once you achieve
that, you'll be able to operate on Windows or Linux equivalently,
since Python 3 solved this problem. At least, I *think* it will; my
current system has a Python 2 installed, but doesn't have tkinter
(because I never bothered to install it), and it's no longer available
from the upstream Debian repos, so I only tested it in the console.
But the decoding certainly worked.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list