Steven D'Aprano:

Some string operations need to inspect every character, e.g. str.upper().
Even for them, the increased complexity of a variable-width encoding
costs. It's not sufficient to walk the string inspecting a fixed 1, 2 or
4 bytes per character. You have to walk the string grabbing 1 byte at a
time, and then decide whether you need another 1, 2 or 3 bytes. Even
though it's still O(N), the added bit-masking and overhead of variable-
width encoding adds to the overall cost.

It does add to implementation complexity but should only add a small amount of time.

To compare costs, I am using the text of the web site http://www.mofa.go.jp/mofaj/ since it has a reasonable amount (10%) of multi-byte characters. Since the document fits in the the BMP, Python would choose a 2-byte wide implementation so I am emulating that choice with a very simple 16-bit table-based upper-caser. Real Unicode case conversion code is more concerned with edge cases like Turkic and Lithuanian locales and Greek combining characters and also allowing for measurement/reallocation for the cases where the result is smaller/larger. See, for example, glib's real_toupper in https://git.gnome.org/browse/glib/tree/glib/guniprop.c

Here is some simplified example code that implements upper-casing over 16-bit wide (utf16_up) and UTF-8 (utf8_up) buffers:
http://www.scintilla.org/UTF8Up.cxx
Since I didn't want to spend too much time writing code it only handles the BMP and doesn't have upper-case table entries outside ASCII for now. If this was going to be worked on further to be made maintainable, most of the masking and so forth would be in macros similar to UTF8_COMPUTE/UTF8_GET in glib. The UTF-8 case ranges from around 5% slower on average in a 32 bit release build (VC2012 on an i7 870) to averaging a little faster in a 64-bit build. They're both around a billion characters per-second.

C:\u\hg\UpUTF\UpUTF>..\x64\Release\UpUTF.exe
Time taken for UTF8 of 80449=0.006528
Time taken for UTF16 of 71525=0.006610
Relative time taken UTF8/UTF16 0.987581

Any string method that takes a starting offset requires the method to
walk the string byte-by-byte. I've even seen languages put responsibility
for dealing with that onto the programmer: the "start offset" is given in
*bytes*, not characters. I don't remember what language this was... it
might have been Haskell? Whatever it was, it horrified me.

It doesn't horrify me - I've been working this way for over 10 years and it seems completely natural. You can wrap access in iterators that hide the byte offsets if you like. This then ensures that all operations on those iterators are safe only allowing the iterator to point at the start/end of valid characters.

Sure. And over a different set of samples, it is less compact. If you
write a lot of Latin-1, Python will use one byte per character, while
UTF-8 will use two bytes per character.

I think you mean writing a lot of Latin-1 characters outside ASCII. However, even people writing texts in, say, French will find that only a small proportion of their text is outside ASCII and so the cost of UTF-8 is correspondingly small.

The counter-problem is that a French document that needs to include one mathematical symbol (or emoji) outside Latin-1 will double in size as a Python string.

   Neil
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to