python_2_unicode_compatible pitfalls

Mikhail Korobov Thu, 27 Dec 2012 11:20:26 -0800

Hi there,

First of all, many kudos for the Python 3.x support in upcoming django 1.5, 
and for the way it is handled (the approach, the docs, etc)!


I think there are some pitfalls with @python_2_unicode_compatible decorator 
as it currently implemented in django (and __str__/__repr__ in general), 
and want to share the thoughts before the 1.5 release. I'm sorry that this 
message is pretty vague; it points to some problems with the current 
approach (some of them are real, some would occur very rarely) but it 
doesn't propose  the solution for django other than "please review the code 
once more".

1) @python_2_unicode_compatible doesn't handle __repr__.
   For example, this affects django.db.models.options.Options,
   django.core.files.base.File (and ContentFile),
   django.contrib.admin.models.LogEntry, django.template.base.Variable
   and probably many others (their __repr__ incorrectly returns unicode).

   It also may be the cause why django.db.models.Model.__repr__ doesn't
   follow Python conventions ("__repr__ should be information-rich and
   unambiguous" - unicode values are replaced with "[Bad Unicode data]").
   By the way, the way django detects whether value needs replacing 
   is not correct and doesn't prevent all errors because what
   "u = six.text_type(self)" do for bytestring is decode data using
   sys.getdefaultencoding() while repr is (most?) often used in console,
   where sys.stdout.encoding matters.

2) under Python 2.x __str__ is implemented as __unicode__
   encoded to utf8. This breaks 'print django_obj' when sys.stdout.encoding
   is not utf8 because print uses __str__ (not __unicode__) for custom 
objects,
   and the terminal expects the result to be encoded in sys.stdout.encoding
   (print encodes unicode strings to sys.stdout.encoding, but doesn't
   use __unicode__ of objects; this is hard-coded in Python 2.x). 
   This may affect REPL in Windows consoles and printing/writing to stdout 
   in management commands.

3) @python_2_unicode_compatible produces incorrect results 
   when applied twice (__str__ is patched by previous decorator application 
   and returns bytestring because of that).
   This is easy to oversight e.g. when applying this decorator to a
   subclass of a class which is wrapped to @python_2_unicode_compatible
   and deleting the overridden __str__ afterwards.

4) __str__ is not always properly implemented for this decorator in django
   code. To work properly with @python_2_unicode_compatible,
   __str__ must return unicode string. This is quite subtle.
   For example, take a look at django.contrib.gis.maps.google.GEvent.
   __str__ is implemented as 
   "return mark_safe('"%s", %s' %(self.event, self.action))",
   but "from __future__ import unicode_literals" is not applied to the file.
   This means that if event and action are Python objects with both __str__
   and __unicode__ methods defined (e.g. object of class wrapped with
   python_2_unicode_compatible) then __str__ would be called for these 
objects,
   not __unicode__ (because the format string is a bytestring). Generally,
   "%s" % something is a good and correct pattern for __str__ implementation
   (it does the right thing under both Python 2.x and 3.x when
   unicode_literals future import is there), but it is incorrect under 
Python
   2.x if unicode_literals is not imported.

5) %r is very tricky. If unicode_literals are in effect, or some 
   arguments for string formatting are unicode,
   "%r" % obj would trigger bytes decoding using sys.getdefaultencoding() 
under
   Python 2.x (unless obj is an unicode string), and if obj.__repr__ returns
   non-ascii text or obj is a bytestring, exception would be raised
   (because sys.getdefaultencoding() is usually ascii).
   This format specifier is used, for example, in a default_error_messages
   for django.db.models.fields.Field; after switching to unicode_literals
   this may start raising UnicodeDecodeExceptions for non-ascii choices
   if they are custom objects (not unicode strings).
   Another example is 
django.http.response.HttpResponseBase._convert_to_charset
   where BadHeaderError exception is raised: after switching to 
unicode_literals
   %r format specifier start triggering decoding of "value" using 
sys.getdefaultencoding()
   which is incorrect because "value" is a bytestring of 'charset' encoding 
under
   Python 2.x. Another example is django.utils.datastructures.SortedDict:
   its __repr__ uses '%r: %r' % (k, v) for k, v in six.iteritems(self)
   which may fail if key is an unicode string and a value is a bytestring
   or an object with __repr__ returning non-ascii text. Another example
   is django.utils.encoding.DjangoUnicodeDecodeError
   (it has incorrect __str__ by the way because it returns unicode) -
   it uses "%r" for self.obj, with unicode string formatter,
   and this would blow up if __repr__ of obj returns non-ascii text.
   There are other places where %r is used and they all are fragile.

I've implemented an another python_2_unicode_compatible decorator (inspired 
by django's, the idea is cool) for 
NLTK: https://github.com/nltk/nltk/blob/2and3/nltk/compat.py#L122 which 
resolves some of issues above (it handles __repr__, limits __str__ and 
__repr__ to ascii and supports subclassing better). The article (rather 
lengthy, with some django bashing :) that provides motivation for the 
decorator used in NLTK: http://kmike.ru/python-with-strings-attached/ (the 
code in the article is a bit outdated, it is not the code used in NLTK; 
NLTK version was improved, but I didn't update the article yet).

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/django-developers/-/TdiQbUpUZU4J.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.

python_2_unicode_compatible pitfalls

Reply via email to