Re: How do I display unicode value stored in a string variable using ord()

Terry Reedy Sun, 19 Aug 2012 10:37:55 -0700

On 8/19/2012 4:04 AM, Paul Rubin wrote:

Meanwhile, an example of the 393 approach failing:

I am completely baffled by this, as this example is one where the 393approach potentially wins.

I was involved in a
project that dealt with terabytes of OCR data of mostly English text.
So the chars were mostly ascii,


3.3 stores ascii pages 1 byte/char rather than 2 or 4.

> but there would be occasional non-ascii

chars including supplementary plane characters, either because of
special symbols that were really in the text, or the typical OCR
confusion emitting those symbols due to printing imprecision.

I doubt that there are really any non-bmp chars. As Steven said, rejectsuch false identifications.


> That's a  natural for UTF-8

3.3 would convert to utf-8 for storage on disk.

but the PEP-393 approach would bloat up the memory
requirements by a factor of 4.

3.2- wide builds would *always* use 4 bytes/char. Is not occasionallybetter than always?

     py> s = chr(0xFFFF + 1)
     py> a, b = s

That looks like Python 3.2 is buggy and that sample should just throw an
error.  s is a one-character string and should not be unpackable.

That looks like a 3.2- narrow build. Such which treat unicode strings assequences of code units rather than sequences of codepoints. Not animplementation bug, but compromise design that goes back about a decadeto when unicode was added to Python. At that time, there were only a fewdefined non-BMP chars and their usage was extremely rare. There are nowmore extended chars than BMP chars and usage will become more commoneven in English text.

Pre 3.3, there are really 2 sub-versions of every Python version: anarrow build and a wide build version, with not very well documenteddifferent behaviors for any string with extended chars. That is andwould have become an increasing problem as extended chars areincreasingly used. If you want to say that what was once a practicalcompromise has become a design bug, I would not argue. In any case, 3.3fixes that split and returns Python to being one cross-platform language.

I realize the folks who designed and implemented PEP 393 are very smart
cookies and considered stuff carefully, while I'm just an internet user
posting an immediate impression of something I hadn't seen before (I
still use Python 2.6), but I still have to ask: if the 393 approach
makes sense, why don't other languages do it?

Python has often copied or borrowed, with adjustments. This time it isthe first. We will see how it goes, but it has been tested for nearly ayear already.

Ropes of UTF-8 segments seems like the most obvious approach and I
wonder if it was considered.  By that I mean pick some implementation
constant k (say k=128) and represent the string as a UTF-8 encoded byte
array, accompanied by a vector n//k pointers into the byte array, where
n is the number of codepoints in the string.  Then you can reach any
offset analogously to reading a random byte on a disk, by seeking to the
appropriate block, and then reading the block and getting the char you
want within it.  Random access is then O(1) though the constant is
higher than it would be with fixed width encoding.

I would call it O(k), where k is a selectable constant. Slowing accessby a factor of 100 is hardly acceptable to me. For strings less than k,access is O(len). I believe slicing would require re-indexing.

As 393 was near adoption, I proposed a scheme using utf-16 (narrowbuilds) with a supplementary index of extended chars when there are any.That makes access O(1) if there are none and O(log(k)), where k is thenumber of extended chars in the string, if there are some.


--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

Reply via email to