New submission from Daniel Stutzbach <dan...@stutzbachenterprises.com>:

Currently, Python can be built with an internal Unicode representation of UCS2 
or UCS4.  To prevent extension modules compiled with the wrong Unicode 
representation from linking, unicodeobject.h #defines many of the Unicode 
functions.  For example, PyUnicode_FromString becomes either 
PyUnicodeUCS2_FromString or PyUnicodeUCS4_FromString.

Consequently, if one installs a binary egg (e.g., with easy_install), there's a 
good chance one will get an error such as the following when trying to use it:

        undefined symbol: PyUnicodeUCS2_FromString

In Python 2, only some extension modules were stung by this problem.  For 
Python 3, virtually every extension type will need to call a PyUnicode_* 
function, since __repr__ must return a Unicode object.  It's basically 
fruitless to upload a binary egg for Python 3 to PyPi, since it will generate 
link errors for a large fraction of downloaders (I discovered this the hard 
way).

Right now, nearly all the functions in unicodeobject.h are wrapped.  Several 
functions are not.  Many of the unwrapped functions also have no documentation, 
so I'm guessing they are newer functions that were not wrapped when they were 
added.

Most extensions treat PyUnicodeObjects as opaque and do not care if the 
internal representation is UCS2 or UCS4.  We can improve ABI compatibility by 
only wrapping functions where the representation matters from the caller's 
point of view.

For example, PyUnicode_FromUnicode creates a Unicode object from an array of 
Py_UNICODE objects.  It will interpret the data differently on UCS2 vs UCS4, so 
the function should be wrapped.

On the other hand, PyUnicode_FromString creates a Unicode object from a char *. 
 The caller can treat the returned object as opaque, so the function should not 
be wrapped.

The attached patch implements that rule.  It unwraps 64 opaque functions that 
were previously wrapped, and wraps 11 non-opaque functions that were previously 
unwrapped.  "make test" works with both UCS2 and UCS4 builds.

I previously brought this issue up on python-ideas, see:
http://mail.python.org/pipermail/python-ideas/2009-November/006543.html

Here's a summary of that discussion:

Zooko Wilcox-O'Hearn pointed out that my proposal is complimentary to his 
proposal to standardize on UCS4, to reduce the risk of extension modules built 
with a mismatched encoding.

Stefan Behnel pointed out that easy_install should allow eggs to specify the 
encoding they require.  PJE's proposed implementation of that feature 
(http://bit.ly/1bO62) would allow eggs to specify UCS2, UCS4, or "Don't Care".  
My proposal greatly increases the number of eggs that could label themselves 
"Don't Care", reducing maintenance work for package maintainers.  In other 
words, they are complimentary fixes.

Guido liked the idea but expressed concern about the possibility of extension 
modules that link successfully, but later crash because they actually do depend 
on the UCS2/UCS4 distinction.

With my current patch, there are still two ways for that to happen:

1) The extension uses only opaque functions, but casts the returned PyObject * 
to PyUnicodeObject * and accesses the str member, or

2) The extension uses only opaque functions, but uses the PyUnicode_AS_UNICODE 
or PyUnicode_AS_DATA macros.

Most packages that poke into the internals of PyUnicodeObject also call 
non-opaque functions.  Consequently, they will still generate a linker error if 
the encoding is mismatched, as desired.

I'm trying to come up with a way to 100% guarantee that any extension poking 
into the internals will generate a linker error if the encoding is mismatched, 
even if they don't call any non-opaque functions.  I'll post about that in a 
separate comment to this bug.

----------
assignee: stutzbach
components: Interpreter Core, Unicode
messages: 105222
nosy: stutzbach
priority: normal
severity: normal
stage: needs patch
status: open
title: Improve ABI compatibility between UCS2 and UCS4 builds
type: behavior
versions: Python 3.2

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue8654>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to