On Jan 16, 5:38 pm, Carl Banks <pavlovevide...@gmail.com> wrote: > On Jan 16, 3:58 pm, Steven D'Aprano <st...@remove-this- > cybersource.com.au> wrote: > > On Sat, 16 Jan 2010 15:35:05 -0800, gizli wrote: > > > Hi all, > > > > I am using Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41). I ran > > > into this issue yesterday and wanted to check to see if this is a > > > python bug. It seems that there is an inconsistency between lists and > > > dictionaries in the way that unicode objects are handled. Take a look at > > > the following example: > > > >>>> test_dict = {u'öğe':1} > > >>>> u'öğe' in test_dict.keys() > > > True > > >>>> 'öğe' in test_dict.keys() > > > True > > > I can't reproduce your result, at least not in 2.6.1: > > > >>> test_dict = {u'öğe':1} > > >>> u'öğe' in test_dict.keys() > > True > > >>> 'öğe' in test_dict.keys() > > > __main__:1: UnicodeWarning: Unicode equal comparison failed to convert > > both arguments to Unicode - interpreting them as being unequal > > False > > The OP changed his default encoding. I was able to confirm the > behavior after setting the default encoding to latin-1. > > This is most definitely a bug in Python.
I've thought it over and I'm not so sure it's a bug now, but it is highly questionable. Here is more detailed explanation. The following script shows why; my terminal is UTF-8. Python 2.5.4 (r254:67916, Nov 19 2009, 19:46:21) [GCC 4.3.4] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> reload(sys) # get sys.setdefaultencoding back <module 'sys' (built-in)> >>> sys.setdefaultencoding('utf-8') >>> u'öğe' == 'öğe' True >>> test_dict = {u'öğe':1} >>> test_dict['öğe'] Traceback (most recent call last): File "<stdin>", line 1, in <module> KeyError: '\xc3\xb6\xc4\x9fe' So the source encoding is UTF-8, and you see I've set the default encoding to UTF-8. You'll notice that u'öğe' and 'öğe' compare equal, this is entirely correct. Given that UTF-8 is the source encoding, the string 'öğe' will be read as a byte-string with the UTF-8 encoding of those Unicode characters. And, given that UTF-8 is also the default encoding, the string will be re-encoded using UTF-8, and so will be equal to the Unicode stirng. Given that the two are equal, the correct behavior for dicts would be to use the two as the same key. However, it doesn't. In fact the two objects don't even have the same hash code: >>> hash(u'öğe') 1671320785 >>> hash('öğe') -813744964 This ought to be a bug; objects that compare equal and are hashable must have the same hash code. However, given that it is crucially important to be as fast as possible when calculating that hash code of ASCII strings, I could imagine that this is deliberate. (And if it is it should be documented so; I looked briefly but did not see it.) I can imagine another buggy possibility as well. test_dict['öğe'] = 2 will add a new key to the above example, but it could overwrite the key if there's a hash collision, because the objects compare equal. All in all, it's a mighty mess. The best advice is to avoid it altogether and leave the default encoding alone. Thankfully Python 3 does away with all this nonsense. Carl Banks -- http://mail.python.org/mailman/listinfo/python-list