Hi there,
First of all, many kudos for the Python 3.x support in upcoming django 1.5,
and for the way it is handled (the approach, the docs, etc)!
I think there are some pitfalls with @python_2_unicode_compatible decorator
as it currently implemented in django (and __str__/__repr__ in general),
and want to share the thoughts before the 1.5 release. I'm sorry that this
message is pretty vague; it points to some problems with the current
approach (some of them are real, some would occur very rarely) but it
doesn't propose the solution for django other than "please review the code
once more".
1) @python_2_unicode_compatible doesn't handle __repr__.
For example, this affects django.db.models.options.Options,
django.core.files.base.File (and ContentFile),
django.contrib.admin.models.LogEntry, django.template.base.Variable
and probably many others (their __repr__ incorrectly returns unicode).
It also may be the cause why django.db.models.Model.__repr__ doesn't
follow Python conventions ("__repr__ should be information-rich and
unambiguous" - unicode values are replaced with "[Bad Unicode data]").
By the way, the way django detects whether value needs replacing
is not correct and doesn't prevent all errors because what
"u = six.text_type(self)" do for bytestring is decode data using
sys.getdefaultencoding() while repr is (most?) often used in console,
where sys.stdout.encoding matters.
2) under Python 2.x __str__ is implemented as __unicode__
encoded to utf8. This breaks 'print django_obj' when sys.stdout.encoding
is not utf8 because print uses __str__ (not __unicode__) for custom
objects,
and the terminal expects the result to be encoded in sys.stdout.encoding
(print encodes unicode strings to sys.stdout.encoding, but doesn't
use __unicode__ of objects; this is hard-coded in Python 2.x).
This may affect REPL in Windows consoles and printing/writing to stdout
in management commands.
3) @python_2_unicode_compatible produces incorrect results
when applied twice (__str__ is patched by previous decorator application
and returns bytestring because of that).
This is easy to oversight e.g. when applying this decorator to a
subclass of a class which is wrapped to @python_2_unicode_compatible
and deleting the overridden __str__ afterwards.
4) __str__ is not always properly implemented for this decorator in django
code. To work properly with @python_2_unicode_compatible,
__str__ must return unicode string. This is quite subtle.
For example, take a look at django.contrib.gis.maps.google.GEvent.
__str__ is implemented as
"return mark_safe('"%s", %s' %(self.event, self.action))",
but "from __future__ import unicode_literals" is not applied to the file.
This means that if event and action are Python objects with both __str__
and __unicode__ methods defined (e.g. object of class wrapped with
python_2_unicode_compatible) then __str__ would be called for these
objects,
not __unicode__ (because the format string is a bytestring). Generally,
"%s" % something is a good and correct pattern for __str__ implementation
(it does the right thing under both Python 2.x and 3.x when
unicode_literals future import is there), but it is incorrect under
Python
2.x if unicode_literals is not imported.
5) %r is very tricky. If unicode_literals are in effect, or some
arguments for string formatting are unicode,
"%r" % obj would trigger bytes decoding using sys.getdefaultencoding()
under
Python 2.x (unless obj is an unicode string), and if obj.__repr__ returns
non-ascii text or obj is a bytestring, exception would be raised
(because sys.getdefaultencoding() is usually ascii).
This format specifier is used, for example, in a default_error_messages
for django.db.models.fields.Field; after switching to unicode_literals
this may start raising UnicodeDecodeExceptions for non-ascii choices
if they are custom objects (not unicode strings).
Another example is
django.http.response.HttpResponseBase._convert_to_charset
where BadHeaderError exception is raised: after switching to
unicode_literals
%r format specifier start triggering decoding of "value" using
sys.getdefaultencoding()
which is incorrect because "value" is a bytestring of 'charset' encoding
under
Python 2.x. Another example is django.utils.datastructures.SortedDict:
its __repr__ uses '%r: %r' % (k, v) for k, v in six.iteritems(self)
which may fail if key is an unicode string and a value is a bytestring
or an object with __repr__ returning non-ascii text. Another example
is django.utils.encoding.DjangoUnicodeDecodeError
(it has incorrect __str__ by the way because it returns unicode) -
it uses "%r" for self.obj, with unicode string formatter,
and this would blow up if __repr__ of obj returns non-ascii text.
There are other places where %r is used and they all are fragile.
I've implemented an another python_2_unicode_compatible decorator (inspired
by django's, the idea is cool) for
NLTK: https://github.com/nltk/nltk/blob/2and3/nltk/compat.py#L122 which
resolves some of issues above (it handles __repr__, limits __str__ and
__repr__ to ascii and supports subclassing better). The article (rather
lengthy, with some django bashing :) that provides motivation for the
decorator used in NLTK: http://kmike.ru/python-with-strings-attached/ (the
code in the article is a bit outdated, it is not the code used in NLTK;
NLTK version was improved, but I didn't update the article yet).
--
You received this message because you are subscribed to the Google Groups
"Django developers" group.
To view this discussion on the web visit
https://groups.google.com/d/msg/django-developers/-/TdiQbUpUZU4J.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/django-developers?hl=en.