Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

Neil Hodgson Thu, 28 Mar 2013 20:43:18 -0700

Steven D'Aprano:

Some string operations need to inspect every character, e.g. str.upper().
Even for them, the increased complexity of a variable-width encoding
costs. It's not sufficient to walk the string inspecting a fixed 1, 2 or
4 bytes per character. You have to walk the string grabbing 1 byte at a
time, and then decide whether you need another 1, 2 or 3 bytes. Even
though it's still O(N), the added bit-masking and overhead of variable-
width encoding adds to the overall cost.

It does add to implementation complexity but should only add a smallamount of time.

To compare costs, I am using the text of the web sitehttp://www.mofa.go.jp/mofaj/ since it has a reasonable amount (10%) ofmulti-byte characters. Since the document fits in the the BMP, Pythonwould choose a 2-byte wide implementation so I am emulating that choicewith a very simple 16-bit table-based upper-caser. Real Unicode caseconversion code is more concerned with edge cases like Turkic andLithuanian locales and Greek combining characters and also allowing formeasurement/reallocation for the cases where the result issmaller/larger. See, for example, glib's real_toupper inhttps://git.gnome.org/browse/glib/tree/glib/guniprop.c

Here is some simplified example code that implements upper-casingover 16-bit wide (utf16_up) and UTF-8 (utf8_up) buffers:

http://www.scintilla.org/UTF8Up.cxx

Since I didn't want to spend too much time writing code it onlyhandles the BMP and doesn't have upper-case table entries outside ASCIIfor now. If this was going to be worked on further to be mademaintainable, most of the masking and so forth would be in macrossimilar to UTF8_COMPUTE/UTF8_GET in glib.The UTF-8 case ranges from around 5% slower on average in a 32 bitrelease build (VC2012 on an i7 870) to averaging a little faster in a64-bit build. They're both around a billion characters per-second.


C:\u\hg\UpUTF\UpUTF>..\x64\Release\UpUTF.exe
Time taken for UTF8 of 80449=0.006528
Time taken for UTF16 of 71525=0.006610
Relative time taken UTF8/UTF16 0.987581

Any string method that takes a starting offset requires the method to
walk the string byte-by-byte. I've even seen languages put responsibility
for dealing with that onto the programmer: the "start offset" is given in
*bytes*, not characters. I don't remember what language this was... it
might have been Haskell? Whatever it was, it horrified me.

It doesn't horrify me - I've been working this way for over 10 yearsand it seems completely natural. You can wrap access in iterators thathide the byte offsets if you like. This then ensures that all operationson those iterators are safe only allowing the iterator to point at thestart/end of valid characters.

Sure. And over a different set of samples, it is less compact. If you
write a lot of Latin-1, Python will use one byte per character, while
UTF-8 will use two bytes per character.

I think you mean writing a lot of Latin-1 characters outside ASCII.However, even people writing texts in, say, French will find that only asmall proportion of their text is outside ASCII and so the cost of UTF-8is correspondingly small.

The counter-problem is that a French document that needs to includeone mathematical symbol (or emoji) outside Latin-1 will double in sizeas a Python string.


   Neil
--
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

Reply via email to