Terry J. Reedy <tjre...@udel.edu> added the comment:

Python's narrow builds are, in a sense, 'between' UCS-2 and UTF-16. They 
support non-BMP chars but only partially, because, BY DESIGN*, indexing and len 
are by code units, not codepoints. They are documented as being UCS-2 because 
that is what M-A Lemburg, the original designer and writer of Python's unicode 
type and the unicode-capable re module, wants them to be called. The link to 
msg142037, which is one of 50+ in the thread (and many or most other disagree), 
pretty well explains his viewpoint. The positive side is that we deliver more 
than we promise. The negative side is that by not promising what perhaps we 
should allows is not to deliver what perhaps we should.

*While I think this design decision may have been OK a decade ago for a first 
implementation of an *optional* text type, I do not think it so for the future 
for revised implementations of what is now *the* text type. I think narrow 
builds can and should be revised and upgraded to index, slice, and measure by 
codepoints. Here is my current idea:

If the code unit stream contains any non-BMP characters (ie, surrogate pair of 
16-bit code units), construct a sequence of *indexes* of such characters 
(pairs). The fixed length of the string in codepoints is n-k, where n is the 
number of code units (the current length) and k is the length of the auxiliary 
sequence and the number of pairs. For indexing, look up the character index in 
the list of indexes by binary search and increment the codepoint index by the 
index of the index found to get the corresponding code unit index. (I have 
omitted the details needed avoid off-by-1 errors.)

This would make indexing O(log(k)) when there are surrogates. If that is really 
a problem because k is a substantial fraction of a 'large' n, then one should 
use a wide build. By using a separate internal class, there would be no time or 
space penalty for all-BMP text. I will work on a prototype in Python.

PS: The OSCON link in msg142036 currently gives me 404 not found

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue12729>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to