On Thu, Sep 27, 2012 at 2:52 AM, Paul Rubin <no.email@nospam.invalid> wrote: > Chris Angelico <ros...@gmail.com> writes: >> When you compare against a wide build, semantics of 3.2 and 3.3 are >> identical, and then - and ONLY then - can you sanely compare >> performance. And 3.3 stacks up much better. > > I like to have seen real world benchmarks against a pure UTF-8 > implementation. That means O(n) access to the n'th character of a > string which could theoretically slow some programs down terribly, but I > wonder how often that actually matters in ways that can't easily be > worked around.
That's pretty much what we have with the PHP parts of our web site. We've decreed that everything should be UTF-8 byte streams (actually, it took some major campaigning from me to get rid of the underlying thinking that "binary-safe" and "UTF-8" and "characters" and so on were all equivalent), but there are very few places where we actually index strings in PHP. There's a small amount of parsing, but it's all done by splitting on particular strings - if you search for 0x0A in a UTF-8 bytestream and split at that index, it's the same as searching for U+000A in a Unicode string and splitting there - and all of our structural elements fit inside ASCII. The few times we actually care about character length (eg limiting user-specified rule names to N characters), we don't much care about performance, because they're unusual checks. So, I don't actually have any stats for you, because it's really easy to just not index strings at all. ChrisA -- http://mail.python.org/mailman/listinfo/python-list