On 8/5/2012 5:28 PM, Greg Hellings wrote:
On Sun, Aug 5, 2012 at 7:19 PM, Chris Little <chris...@crosswire.org> wrote:
On Aug 5, 2012, at 11:37 AM, David Haslam <dfh...@googlemail.com> wrote:
FWIW, I just came across this http://www.pythonregex.com/ Python Regular
Expression Testing Tool
Does Python support the full 21-bit Unicode range?
cf. Many other regular expression engines only support the Basic
Multilingual Plane.
Yes, Python regex supports non-BMP characters. The language tags are Plane 14,
I believe. An engine that supports only the BMP can't be said to support
Unicode and is probably just processing bytes.
As further explanation, Python differentiates between the "string"
object, which is 8-bit encoding representation of objects in any
selected encoding and "unicode" objects which are strings of Unicode
characters. The exact internal representation probably differs between
CPython and Jython. CPython used to use UCS-2 but now can use either
UCS-2 or UCS-4 since the extension of the BMP.
To read more details see
http://www.cmlenz.net/archives/2008/07/the-truth-about-unicode-in-python
under the heading "Internal Representation".
Oh. Well, that's annoying.
To see whether your Python interpreter is compiled with UCS-2 or UCS-4,
you can run this from the interpreter:
import sys
sys.maxunicode
If it returns 65535, it's using UCS-2. If 1114111, then UCS-4.
Linux packagers apparently go the UCS-4 route, so I didn't notice any
issue with using the Language Tags. But trying the above on Windows
shows that the cygwin build and the builds from python.org (2.7 & 3.2)
all use UCS-2. So my script won't work correctly on Windows.
Not to worry, though. I'll just replace the Language Tags with
Noncharacters in the range u+FDD0-u+FDEF. They're UCS-2-safe since
they're BMP codepoints and they're specifically designated as "intended
for process-internal uses, but are not permitted for interchange." So in
the unlikely event that they appear in input, it's the fault of the
USFM-encoder if anything goes awry.
We'll have to watch for input outside of the BMP on UCS-2 Python,
though, as that could cause problems.
--Chris
_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page