On Sat, 18 Aug 2012 19:35:44 -0700, Paul Rubin wrote: > Scanning 4 characters (or a few dozen, say) to peel off a token in > parsing a UTF-8 string is no big deal. It gets more expensive if you > want to index far more deeply into the string. I'm asking how often > that is done in real code.
It happens all the time. Let's say you've got a bunch of text, and you use a regex to scan through it looking for a match. Let's ignore the regular expression engine, since it has to look at every character anyway. But you've done your search and found your matching text and now want everything *after* it. That's not exactly an unusual use-case. mo = re.search(pattern, text) if mo: start, end = mo.span() result = text[end:] Easy-peasy, right? But behind the scenes, you have a problem: how does Python know where text[end:] starts? With fixed-size characters, that's O(1): Python just moves forward end*width bytes into the string. Nice and fast. With a variable-sized characters, Python has to start from the beginning again, and inspect each byte or pair of bytes. This turns the slice operation into O(N) and the combined op (search + slice) into O(N**2), and that starts getting *horrible*. As always, "everything is fast for small enough N", but you *really* don't want O(N**2) operations when dealing with large amounts of data. Insisting that the regex functions only ever return offsets to valid character boundaries doesn't help you, because the string slice method cannot know where the indexes came from. I suppose you could have a "fast slice" and a "slow slice" method, but really, that sucks, and besides all that does is pass responsibility for tracking character boundaries to the developer instead of the language, and you know damn well that they will get it wrong and their code will silently do the wrong thing and they'll say that Python sucks and we never used to have this problem back in the good old days with ASCII. Boo sucks to that. UCS-4 is an option, since that's fixed-width. But it's also bulky. For typical users, you end up wasting memory. That is the complaint driving PEP 393 -- memory is cheap, but it's not so cheap that you can afford to multiply your string memory by four just in case somebody someday gives you a character in one of the supplementary planes. If you have oodles of memory and small data sets, then UCS-4 is probably all you'll ever need. I hear that the club for people who have all the memory they'll ever need is holding their annual general meeting in a phone-booth this year. You could say "Screw the full Unicode standard, who needs more than 64K different characters anyway?" Well apart from Asians, and historians, and a bunch of other people. If you can control your data and make sure no non-BMP characters are used, UCS-2 is fine -- except Python doesn't actually use that. You could do what Python 3.2 narrow builds do: use UTF-16 and leave it up to the individual programmer to track character boundaries, and we know how well that works. Luckily the supplementary planes are only rarely used, and people who need them tend to buy more memory and use wide builds. People who only need a few non-BMP characters in a narrow build generally just cross their fingers and hope for the best. You could add a whole lot more heavyweight infrastructure to strings, turn them into suped-up ropes-on-steroids. All those extra indexes mean that you don't save any memory. Because the objects are so much bigger and more complex, your CPU cache goes to the dogs and your code still runs slow. Which leaves us right back where we started, PEP 393. > Obviously one can concoct hypothetical examples that would suffer. If you think "slicing at arbitrary indexes" is a hypothetical example, I don't know what to say. -- Steven -- http://mail.python.org/mailman/listinfo/python-list