Le mardi 21 août 2012 09:52:09 UTC+2, Peter Otten a écrit : > wxjmfa...@gmail.com wrote: > > > > > By chance and luckily, first attempt. > > > > > c:\python32\python -m timeit "('€'*100+'€'*100).replace('€' > > > , 'œ')" > > > 1000000 loops, best of 3: 1.48 usec per loop > > > c:\python33\python -m timeit "('€'*100+'€'*100).replace('€' > > > , 'œ')" > > > 100000 loops, best of 3: 7.62 usec per loop > > > > OK, that is roughly factor 5. Let's see what I get: > > > > $ python3.2 -m timeit '("€"*100+"€"*100).replace("€", "œ")' > > 100000 loops, best of 3: 1.8 usec per loop > > $ python3.3 -m timeit '("€"*100+"€"*100).replace("€", "œ")' > > 10000 loops, best of 3: 9.11 usec per loop > > > > That is factor 5, too. So I can replicate your measurement on an AMD64 Linux > > system with self-built 3.3 versus system 3.2. > > > > > Note > > > The used characters are not members of the latin-1 coding > > > scheme (btw an *unusable* coding). > > > They are however charaters in cp1252 and mac-roman. > > > > You seem to imply that the slowdown is connected to the inability of latin-1 > > to encode "œ" and "€" (to take the examples relevant to the above > > microbench). So let's repeat with latin-1 characters: > > > > $ python3.2 -m timeit '("ä"*100+"ä"*100).replace("ä", "ß")' > > 100000 loops, best of 3: 1.76 usec per loop > > $ python3.3 -m timeit '("ä"*100+"ä"*100).replace("ä", "ß")' > > 10000 loops, best of 3: 10.3 usec per loop > > > > Hm, the slowdown is even a tad bigger. So we can safely dismiss your theory > > that an unfortunate choice of the 8 bit encoding is causing it. Do you > > agree?
- I do not care too much about the numbers. It's an attempt to show the principles. - The fact, considering latin-1 as a bad coding, lies on the point that is is simply unsuable for some scripts / languages. It has mainly to do with source/text files coding. This is not really the point here. - Now, the technical aspect. This "coding" (latin-1) may be considered somehow as the pseudo-coding covering the unicode code points range 128..255. Unfortunatelly, this "coding" is not very optimal (or can be see as) when you work with a full range of Unicode, but is is fine when one works only in pure latin-1, with only 256 characters. This range 128..255 is always the critical part (all codings considered). And probably represents the most used characters. I hope, it was not too confused. I have no proof for my theory. With my experience on that field, I highly suspect this as the bottleneck. Some os as before. Py 3.2.3 >>> timeit.repeat("('€'*100+'€'*100).replace('€', 'œ')") [1.5384088242603358, 1.532421642233382, 1.5327445924545433] >>> timeit.repeat("('ä'*100+'ä'*100).replace('ä', 'ß')") [1.561762063667686, 1.5443503206462594, 1.5458670051605168] 3.3.0b2 >>> timeit.repeat("('€'*100+'€'*100).replace('€', 'œ')") [7.701523104134512, 7.720358191179441, 7.614549852683501]>>> >>> timeit.repeat("('ä'*100+'ä'*100).replace('ä', 'ß')") [4.887939423990709, 4.868787294350611, 4.865697999795991] Quite mysterious! In any way it is a regression. jmf -- http://mail.python.org/mailman/listinfo/python-list