On 6/27/06, Dennis Benzinger <[EMAIL PROTECTED]> wrote: > Serge Orlov wrote: > > On 6/27/06, Dennis Benzinger <[EMAIL PROTECTED]> wrote: > >> Hi! > >> > >> The following program in an UTF-8 encoded file: > >> > >> > >> # -*- coding: UTF-8 -*- > >> > >> FIELDS = ("Fächer", ) > >> FROZEN_FIELDS = frozenset(FIELDS) > >> FIELDS_SET = set(FIELDS) > >> > >> print u"Fächer" in FROZEN_FIELDS > >> print u"Fächer" in FIELDS_SET > >> print u"Fächer" in FIELDS > >> > >> > >> gives this output > >> > >> > >> False > >> False > >> Traceback (most recent call last): > >> File "test.py", line 9, in ? > >> print u"FÀcher" in FIELDS > >> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: > >> ordinal not in range(128) > >> > >> > >> Why do the first two print statements succeed and the third one fails > >> with an exception? > > > > Actually all three statements fail to produce correct result. > > So this is a bug in Python?
No. > > frozenset remove the exception? > > > > Because sets use hash algorithm to find matches, whereas the last > > statement directly compares a unicode string with a byte string. Byte > > strings can only contain ascii characters, that's why python raises an > > exception. The problem is very easy to fix: use unicode strings for > > all non-ascii strings. > > No, byte strings contain characters which are at least 8-bit wide > <http://docs.python.org/ref/types.html>. Yes, but later it's written that non-ascii characters do not have universal meaning assigned to them. In other words if you put byte 0xE4 into a bytes string all python knows about it is that it's *some* character. If you put character U+00E4 into a unicode string python knows it's a "latin small letter a with diaeresis". Trying to compare *some* character with a specific character is obviously undefined. > But I don't understand what > Python is trying to decode and why the exception says something about > the ASCII codec, because my file is encoded with UTF-8. Because byte strings can come from different sources (network, files, etc) not only from the sources of your program python cannot assume all of them are utf-8. It assumes they are ascii, because most of wide-spread text encodings are ascii bases. Actually it's a guess, since there are utf-16, utf-32 and other non-ascii encodings. If you want to experience the life without guesses put sys.setdefaultencoding("undefined") into site.py -- http://mail.python.org/mailman/listinfo/python-list