>>> sys.version '3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)]' >>> timeit.timeit("('ab…' * 1000).replace('…', '……')") 37.32762490493721 timeit.timeit("('ab…' * 10).replace('…', 'œ…')") 0.8158757139801764
>>> sys.version '3.3.0b2 (v3.3.0b2:4972a8f1b2aa, Aug 12 2012, 15:02:36) [MSC v.1600 32 bit (Intel)]' >>> imeit.timeit("('ab…' * 1000).replace('…', '……')") 61.919225272152346 >>> timeit.timeit("('ab…' * 10).replace('…', 'œ…')") 1.2918679017971044 timeit.timeit("('ab…' * 10).replace('…', '€…')") 1.2484133226156757 * I intuitively and empirically noticed, this happens for cp1252 or mac-roman characters and not characters which are elements of the latin-1 coding scheme. * Bad luck, such characters are usual characters in French scripts (and in some other European language). * I do not recall the extreme cases I found. Believe me, when I'm speaking about a few 100%, I do not lie. My take of the subject. This is a typical Python desease. Do not solve a problem, but find a way, a workaround, which is expecting to solve a problem and which finally solves nothing. As far as I know, to break the "BMP limit", the tools are here. They are called utf-8 or ucs-4/utf-32. One day, I fell on very, very old mail message, dating at the time of the introduction of the unicode type in Python 2. If I recall correctly it was from Victor Stinner. He wrote something like this "Let's go with ucs-4, and the problems are solved for ever". He was so right. I'm spying the dev-list since years, my feeling is that there is always a latent and permanent conflict between "ascii users" and "non ascii users" (see the unicode literal reintroduction). Please, do not get me wrong. As a non-computer scientist, I'm very happy with Python. If I try to take a distant eye, I became more and more sceptical. PS Py3.3b2 is still crashing, silently exiting, with cp65001. jmf -- http://mail.python.org/mailman/listinfo/python-list