Serge Orlov wrote: > On 6/27/06, Dennis Benzinger <[EMAIL PROTECTED]> wrote: >> Serge Orlov wrote: >> > On 6/27/06, Dennis Benzinger <[EMAIL PROTECTED]> wrote: >> >> Hi! >> >> >> >> The following program in an UTF-8 encoded file: >> >> >> >> >> >> # -*- coding: UTF-8 -*- >> >> >> >> FIELDS = ("Fächer", ) >> >> FROZEN_FIELDS = frozenset(FIELDS) >> >> FIELDS_SET = set(FIELDS) >> >> >> >> print u"Fächer" in FROZEN_FIELDS >> >> print u"Fächer" in FIELDS_SET >> >> print u"Fächer" in FIELDS >> >> >> >> >> >> gives this output >> >> >> >> >> >> False >> >> False >> >> Traceback (most recent call last): >> >> File "test.py", line 9, in ? >> >> print u"FÀcher" in FIELDS >> >> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in >> position 1: >> >> ordinal not in range(128) >> >> >> >> >> >> Why do the first two print statements succeed and the third one fails >> >> with an exception? >> > >> > Actually all three statements fail to produce correct result. >> >> So this is a bug in Python? > > No. > >> > frozenset remove the exception? >> > >> > Because sets use hash algorithm to find matches, whereas the last >> > statement directly compares a unicode string with a byte string. Byte >> > strings can only contain ascii characters, that's why python raises an >> > exception. The problem is very easy to fix: use unicode strings for >> > all non-ascii strings. >> >> No, byte strings contain characters which are at least 8-bit wide >> <http://docs.python.org/ref/types.html>. > > Yes, but later it's written that non-ascii characters do not have > universal meaning assigned to them. In other words if you put byte > 0xE4 into a bytes string all python knows about it is that it's *some* > character. If you put character U+00E4 into a unicode string python > knows it's a "latin small letter a with diaeresis". Trying to compare > *some* character with a specific character is obviously undefined. > [...]
But <http://docs.python.org/ref/comparisons.html> says: Strings are compared lexicographically using the numeric equivalents (the result of the built-in function ord()) of their characters. Unicode and 8-bit strings are fully interoperable in this behavior. Doesn't this mean that Unicode and 8-bit strings can be compared and this comparison is well defined? (even if it's is not meaningful) Thanks for your anwsers, Dennis -- http://mail.python.org/mailman/listinfo/python-list