On 8/5/2012 5:28 PM, Greg Hellings wrote:
On Sun, Aug 5, 2012 at 7:19 PM, Chris Little <chris...@crosswire.org> wrote:


On Aug 5, 2012, at 11:37 AM, David Haslam <dfh...@googlemail.com> wrote:

FWIW, I just came across this  http://www.pythonregex.com/ Python Regular
Expression Testing Tool

Does Python support the full 21-bit Unicode range?

cf. Many other regular expression engines only support the Basic
Multilingual Plane.


Yes, Python regex supports non-BMP characters. The language tags are Plane 14, 
I believe. An engine that supports only the BMP can't be said to support 
Unicode and is probably just processing bytes.


As further explanation, Python differentiates between the "string"
object, which is 8-bit encoding representation of objects in any
selected encoding and "unicode" objects which are strings of Unicode
characters. The exact internal representation probably differs between
CPython and Jython. CPython used to use UCS-2 but now can use either
UCS-2 or UCS-4 since the extension of the BMP.

To read more details see
http://www.cmlenz.net/archives/2008/07/the-truth-about-unicode-in-python
under the heading "Internal Representation".

Oh. Well, that's annoying.

To see whether your Python interpreter is compiled with UCS-2 or UCS-4, you can run this from the interpreter:

import sys
sys.maxunicode

If it returns 65535, it's using UCS-2. If 1114111, then UCS-4.

Linux packagers apparently go the UCS-4 route, so I didn't notice any issue with using the Language Tags. But trying the above on Windows shows that the cygwin build and the builds from python.org (2.7 & 3.2) all use UCS-2. So my script won't work correctly on Windows.

Not to worry, though. I'll just replace the Language Tags with Noncharacters in the range u+FDD0-u+FDEF. They're UCS-2-safe since they're BMP codepoints and they're specifically designated as "intended for process-internal uses, but are not permitted for interchange." So in the unlikely event that they appear in input, it's the fault of the USFM-encoder if anything goes awry.

We'll have to watch for input outside of the BMP on UCS-2 Python, though, as that could cause problems.

--Chris


_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Reply via email to