[ python-Bugs-1564763 ] Unicode comparison change in 2.4 vs. 2.5

SourceForge.net Wed, 27 Sep 2006 03:23:19 -0700

Bugs item #1564763, was opened at 2006-09-25 01:43
Message generated for change (Comment added) made by lemburg
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1564763&group_id=5470


Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Unicode
Group: Python 2.5
>Status: Closed
>Resolution: Wont Fix
Priority: 5
Submitted By: Joe Wreschnig (piman)
Assigned to: M.-A. Lemburg (lemburg)
Summary: Unicode comparison change in 2.4 vs. 2.5

Initial Comment:
Python 2.5 changed the behavior of unicode comparisons
in a significant way from Python 2.4, causing a test
case failure in a module of mine. All tests passed with
an earlier version of 2.5, though unfortunately I don't
know what version in particular it started failing with.

The following code prints out all True on Python 2.4;
the strings are compared case-insensitively, whether
they are my lowerstr class, real strs, or unicodes. On
Python 2.5, the comparison between lowerstr and unicode
is false, but only in one direction.

If I make lowerstr inherit from unicode rather than
str, all comparisons are true again. So at the very
least, this is internally inconsistent. I also think
changing the behavior between 2.4 and 2.5 constitutes a
serious bug.

----------------------------------------------------------------------

>Comment By: M.-A. Lemburg (lemburg)
Date: 2006-09-27 12:22

Message:
Logged In: YES 
user_id=38388

Agreed.

In Python 2.4, doing the u'baR' == l'Bar' comparison does
try l'Bar' == u'baR' due to the special case in
default_3way_compare() I removed for Python 2.5. 

In Python 2.5 it doesn't due to the new rich comparison code
for Unicode.

I don't see any way to make Joe's code work with Python 2.5
other than using unicode as baseclass which is probably the
right things to do anyway in preparation for Python 3k.

Closing as won't fix.


----------------------------------------------------------------------

Comment By: Armin Rigo (arigo)
Date: 2006-09-27 10:58

Message:
Logged In: YES 
user_id=4771

Well, yes, that's what I tried to explain.  I also tried to
explain how the 2.5 behavior is the "right" one, and the
previous 2.4 behavior is a mere accident of convoluted
__eq__-vs-__cmp__ code paths in the comparison code.

In other words, there is no chance to get the 2.4 behavior
in, say, Python 3000, because the __cmp__-related
convolutions will be gone and we will only have the "right"
behavior left.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2006-09-26 13:13

Message:
Logged In: YES 
user_id=38388

In any case, the introduction of the Unicode tp_richcompare
slot is likely the cause for this behavior:

$python2.5 lowerstr.py
u'baR' == l'Bar'?       False
$ python2.4 lowerstr.py
u'baR' == l'Bar'?       True

Note that in both Python 2.4 and 2.5, the lowerstr.__eq__()
method is not even called. This is probably due to the fact
that Unicode can compare itself to strings, so the
w.__eq__(v) part of the rich comparison is never tried.

Now, the Unicode .__eq__() converts the string to Unicode,
so the right hand side becomes u'Bar' in both cases.

I guess a debugger session is due...


----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2006-09-26 12:55

Message:
Logged In: YES 
user_id=38388

Ah, wrong track: Py_TPFLAGS_HAVE_RICHCOMPARE is set via
Py_TPFLAGS_DEFAULT.


----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2006-09-26 12:39

Message:
Logged In: YES 
user_id=38388

Armin, is it possible that the missing
Py_TPFLAGS_HAVE_RICHCOMPARE type flag in the Unicode type is
causing this ?

I just had a look at the code and it appears that the
comparison code checks the flag rather than just looking at
the slot itself (didn't even know there was such a type flag).


----------------------------------------------------------------------

Comment By: Armin Rigo (arigo)
Date: 2006-09-25 23:33

Message:
Logged In: YES 
user_id=4771

Sorry, I missed your comment: if lowerstr inherits from
unicode then it just works.  The reason is that
'abc'.__eq__(u'abc') returns NotImplemented, but
u'abc'.__eq__('abc') returns True.

This is only inconsistent because of the asymmetry between
strings and unicodes: strings can be transparently turned
into unicodes but not the other way around -- so
unicode.__eq__(x) can accept a string as the argument x
and convert it to a unicode transparently, but str.__eq__(x)
does not try to convert x to a string if it is a unicode.

It's not a completely convincing explanation, but I think it
shows at least why we got at the current situation of Python
2.5.

----------------------------------------------------------------------

Comment By: Armin Rigo (arigo)
Date: 2006-09-25 23:11

Message:
Logged In: YES 
user_id=4771

This is an artifact of the change in the unicode class, which
now has the proper __eq__, __ne__, __lt__, etc. methods
instead of the semi-deprecated __cmp__.  The mixture of
__cmp__ and the other methods is not very well-defined.  This
is why your code worked in 2.4: a bit by chance.

Indeed, in theory it should not, according to the language
reference.  So what I am saying is that although it is a
behavior change from 2.4 to 2.5, I would argue that it is not
a bug but a bug fix...

The reason is that if we ignore the __eq__ vs __cmp__ issues,
the operation 'a == b' is defined as: Python tries
a.__eq__(b); if this returns NotImplemented, then Python
tries b.__eq__(a).  As an exception, if type(b) is a strict
subclass of type(a), then Python tries in the other order. 
This is why you get the 2.5 behavior: if lowerstr inherits
from str, it is not a subclass of unicode, so u'abc' ==
lowerstr() tries u'abc'.__eq__(), which works immediately. 
On the other hand, if lowerstr inherits from unicode, then
Python tries first lowerstr().__eq__(u'abc').

This part of the Python object model - when to reverse the
order or not - is a bit obscure and not completely helpful...
Subclassing built-in types generally only works a bit.  In
your situation you should use a regular class that behaves in
a string-like fashion, with an __eq__() method doing the
case-insensitive comparison... if you can at all - there are
places where you need a real string, so this "solution" might
not be one either, but I don't see a better one :-(

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1564763&group_id=5470
_______________________________________________
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[ python-Bugs-1564763 ] Unicode comparison change in 2.4 vs. 2.5

Reply via email to