On Sat, 18 Aug 2012 11:05:07 -0700, wxjmfauth wrote: > As I understand (I think) the undelying mechanism, I can only say, it is > not a surprise that it happens. > > Imagine an editor, I type an "a", internally the text is saved as ascii, > then I type en "é", the text can only be saved in at least latin-1. Then > I enter an "€", the text become an internal ucs-4 "string". The remove > the "€" and so on.
Firstly, that is not what Python does. For starters, € is in the BMP, and so is nearly every character you're ever going to use unless you are Asian or a historian using some obscure ancient script. NONE of the examples you have shown in your emails have included 4-byte characters, they have all been ASCII or UCS-2. You are suffering from a misunderstanding about what is going on and misinterpreting what you have seen. In *both* Python 3.2 and 3.3, both é and € are represented by two bytes. That will not change. There is a tiny amount of fixed overhead for strings, and that overhead is slightly different between the versions, but you'll never notice the difference. Secondly, how a text editor or word processor chooses to store the text that you type is not the same as how Python does it. A text editor is not going to be creating a new immutable string after every key press. That will be slow slow SLOW. The usual way is to keep a buffer for each paragraph, and add and subtract characters from the buffer. > Intuitively I expect there is some kind slow down between all these > "strings" conversion. Your intuition is wrong. Strings are not converted from ASCII to USC-2 to USC-4 on the fly, they are converted once, when the string is created. The tests we ran earlier, e.g.: ('ab…' * 1000).replace('…', 'œ…') show the *worst possible case* for the new string handling, because all we do is create new strings. First we create a string 'ab…', then we create another string 'ab…'*1000, then we create two new strings '…' and 'œ…', and finally we call replace and create yet another new string. But in real applications, once you have created a string, you don't just immediately create a new one and throw the old one away. You likely do work with that string: steve@runes:~$ python3.2 -m timeit "s = 'abcœ…'*1000; n = len(s); flag = s.startswith(('*', 'a'))" 100000 loops, best of 3: 2.41 usec per loop steve@runes:~$ python3.3 -m timeit "s = 'abcœ…'*1000; n = len(s); flag = s.startswith(('*', 'a'))" 100000 loops, best of 3: 2.29 usec per loop Once you start doing *real work* with the strings, the overhead of deciding whether they should be stored using 1, 2 or 4 bytes begins to fade into the noise. > When I tested this flexible representation, a few months ago, at the > first alpha release. This is precisely what, I tested. String > manipulations which are forcing this internal change and I concluded the > result is not brillant. Realy, a factor 0.n up to 10. Like I said, if you really think that there is a significant, repeatable slow-down on Windows, report it as a bug. > Does any body know a way to get the size of the internal "string" in > bytes? sys.getsizeof(some_string) steve@runes:~$ python3.2 -c "from sys import getsizeof as size; print(size ('abcœ…'*1000))" 10030 steve@runes:~$ python3.3 -c "from sys import getsizeof as size; print(size ('abcœ…'*1000))" 10038 As I said, there is a *tiny* overhead difference. But identifiers will generally be smaller: steve@runes:~$ python3.2 -c "from sys import getsizeof as size; print(size (size.__name__))" 48 steve@runes:~$ python3.3 -c "from sys import getsizeof as size; print(size (size.__name__))" 34 You can check the object overhead by looking at the size of the empty string. -- Steven -- http://mail.python.org/mailman/listinfo/python-list