On 8/19/2012 4:04 AM, Paul Rubin wrote:
Meanwhile, an example of the 393 approach failing:
I am completely baffled by this, as this example is one where the 393 approach potentially wins.
I was involved in a project that dealt with terabytes of OCR data of mostly English text. So the chars were mostly ascii,
3.3 stores ascii pages 1 byte/char rather than 2 or 4. > but there would be occasional non-ascii
chars including supplementary plane characters, either because of special symbols that were really in the text, or the typical OCR confusion emitting those symbols due to printing imprecision.
I doubt that there are really any non-bmp chars. As Steven said, reject such false identifications.
> That's a natural for UTF-8 3.3 would convert to utf-8 for storage on disk.
but the PEP-393 approach would bloat up the memory requirements by a factor of 4.
3.2- wide builds would *always* use 4 bytes/char. Is not occasionally better than always?
py> s = chr(0xFFFF + 1) py> a, b = s That looks like Python 3.2 is buggy and that sample should just throw an error. s is a one-character string and should not be unpackable.
That looks like a 3.2- narrow build. Such which treat unicode strings as sequences of code units rather than sequences of codepoints. Not an implementation bug, but compromise design that goes back about a decade to when unicode was added to Python. At that time, there were only a few defined non-BMP chars and their usage was extremely rare. There are now more extended chars than BMP chars and usage will become more common even in English text.
Pre 3.3, there are really 2 sub-versions of every Python version: a narrow build and a wide build version, with not very well documented different behaviors for any string with extended chars. That is and would have become an increasing problem as extended chars are increasingly used. If you want to say that what was once a practical compromise has become a design bug, I would not argue. In any case, 3.3 fixes that split and returns Python to being one cross-platform language.
I realize the folks who designed and implemented PEP 393 are very smart cookies and considered stuff carefully, while I'm just an internet user posting an immediate impression of something I hadn't seen before (I still use Python 2.6), but I still have to ask: if the 393 approach makes sense, why don't other languages do it?
Python has often copied or borrowed, with adjustments. This time it is the first. We will see how it goes, but it has been tested for nearly a year already.
Ropes of UTF-8 segments seems like the most obvious approach and I wonder if it was considered. By that I mean pick some implementation constant k (say k=128) and represent the string as a UTF-8 encoded byte array, accompanied by a vector n//k pointers into the byte array, where n is the number of codepoints in the string. Then you can reach any offset analogously to reading a random byte on a disk, by seeking to the appropriate block, and then reading the block and getting the char you want within it. Random access is then O(1) though the constant is higher than it would be with fixed width encoding.
I would call it O(k), where k is a selectable constant. Slowing access by a factor of 100 is hardly acceptable to me. For strings less than k, access is O(len). I believe slicing would require re-indexing.
As 393 was near adoption, I proposed a scheme using utf-16 (narrow builds) with a supplementary index of extended chars when there are any. That makes access O(1) if there are none and O(log(k)), where k is the number of extended chars in the string, if there are some.
-- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list