On Thu, 16 Jan 2014 10:51:42 +0000, Robin Becker wrote: > On 16/01/2014 00:32, Steven D'Aprano wrote: >>> >Or are you saying thatwww.unicode.org is wrong about the definitions >>> >of Unicode terms? >> No, I think he is saying that he doesn't know Unicode anywhere near as >> well as he thinks he does. The question is, will he cherish his >> ignorance, or learn from this thread? > > I assure you that I fully understand my ignorance of unicode.
Robin, while I'm very happy to see that you have a good grasp of what you don't know, I'm afraid that you're misrepresenting me. You deleted the part of my post that made it clear that I was referring to our resident Unicode crank, JMF <wxjmfa...@gmail.com>. > Until > recently I didn't even know that the unicode in python 2.x is considered > broken and that str in python 3.x is considered 'better'. No need for scare quotes. The unicode type in Python 2.x is less-good because: - it is not the default string type (you have to prefix the string with a u to get Unicode); - it is missing some functionality, e.g. casefold; - there are two distinct implementations, narrow builds and wide builds; - wide builds take up to four times more memory per string as needed; - narrow builds take up to two times more memory per string as needed; - worse, narrow builds have very naive (possibly even "broken") handling of code points in the Supplementary Multilingual Planes. The unicode string type in Python 3 is better because: - it is the default string type; - it includes more functionality; - starting in Python 3.3, it gets rid of the distinction between narrow and wide builds; - which reduces the memory overhead of strings by up to a factor of four in many cases; - and fixes the issue of SMP code points. > I can say that having made a lot of reportlab work in both 2.7 & 3.3 I > don't understand why the latter seems slower especially since we try to > convert early to unicode/str as a desirable internal form. *shrug* Who knows? Is it slower or does it only *seem* slower? Is the performance regression platform specific? Have you traded correctness for speed, that is, does 2.7 version break when given astral characters on a narrow build? Earlier in January, you commented in another thread that "I'm not sure if we have any non-bmp characters in the tests." If you don't, you should have some. There's all sorts of reasons why your code might be slower under 3.3, including the possibility of a non-trivial performance regression. If you can demonstrate a test case with a significant slowdown for real-world code, I'm sure that a bug report will be treated seriously. > Probably I > have some horrible error going on(eg one of the C extensions is working > in 2.7 and not in 3.3). Well that might explain a slowdown. But really, one should expect that moving from single byte strings to up to four-byte strings will have *some* cost. It's exchanging functionality for time. The same thing happened years ago, people used to be extremely opposed to using floating point doubles instead of singles because of performance. And, I suppose it is true that back when 64K was considered a lot of memory, using eight whole bytes per floating point number (let alone ten like the IEEE Extended format) might have seemed the height of extravagance. But today we use doubles by default, and if singles would be a tiny bit faster, who wants to go back to the bad old days of single precision? I believe the same applies to Unicode versus single-byte strings. -- Steven -- https://mail.python.org/mailman/listinfo/python-list