Le mercredi 29 août 2012 06:16:05 UTC+2, Ian a écrit : > On Tue, Aug 28, 2012 at 8:42 PM, rusi <rustompm...@gmail.com> wrote: > > > In summary: > > > 1. The problem is not on jmf's computer > > > 2. It is not windows-only > > > 3. It is not directly related to latin-1 encodable or not > > > > > > The only question which is not yet clear is this: > > > Given a typical string operation that is complexity O(n), in more > > > detail it is going to be O(a + bn) > > > If only a is worse going 3.2 to 3.3, it may be a small issue. > > > If b is worse by even a tiny amount, it is likely to be a significant > > > regression for some use-cases. > > > > As has been pointed out repeatedly already, this is a microbenchmark. > > jmf is focusing in one one particular area (string construction) where > > Python 3.3 happens to be slower than Python 3.2, ignoring the fact > > that real code usually does lots of things other than building > > strings, many of which are slower to begin with. In the real-world > > benchmarks that I've seen, 3.3 is as fast as or faster than 3.2. > > Here's a much more realistic benchmark that nonetheless still focuses > > on strings: word counting. > > > > Source: http://pastebin.com/RDeDsgPd > > > > > > C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc" > > "wc.wc('unilang8.htm')" > > 1000 loops, best of 3: 310 usec per loop > > > > C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc" > > "wc.wc('unilang8.htm')" > > 1000 loops, best of 3: 302 usec per loop > > > > "unilang8.htm" is an arbitrary UTF-8 document containing a broad swath > > of Unicode characters that I pulled off the web. Even though this > > program is still mostly string processing, Python 3.3 wins. Of > > course, that's not really a very good test -- since it reads the file > > on every pass, it probably spends more time in I/O than it does in > > actual processing. Let's try it again with prepared string data: > > > > > > C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc; t = > > open('unilang8.htm', 'r', encoding > > ='utf-8').read()" "wc.wc_str(t)" > > 10000 loops, best of 3: 87.3 usec per loop > > > > C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc; t = > > open('unilang8.htm', 'r', encoding > > ='utf-8').read()" "wc.wc_str(t)" > > 10000 loops, best of 3: 84.6 usec per loop > > > > Nope, 3.3 still wins. And just for the sake of my own curiosity, I > > decided to try it again using str.split() instead of a StringIO. > > Since str.split() creates more strings, I expect Python 3.2 might > > actually win this time. > > > > > > C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc; t = > > open('unilang8.htm', 'r', encoding > > ='utf-8').read()" "wc.wc_split(t)" > > 10000 loops, best of 3: 88 usec per loop > > > > C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc; t = > > open('unilang8.htm', 'r', encoding > > ='utf-8').read()" "wc.wc_split(t)" > > 10000 loops, best of 3: 76.5 usec per loop > > > > Interestingly, although Python 3.2 performs the splits in about the > > same time as the StringIO operation, Python 3.3 is significantly > > *faster* using str.split(), at least on this data set. > > > > > > > So doing some arm-chair thinking (I dont know the code and difficulty > > > involved): > > > > > > Clearly there are 3 string-engines in the python 3 world: > > > - 3.2 narrow > > > - 3.2 wide > > > - 3.3 (flexible) > > > > > > How difficult would it be to giving the choice of string engine as a > > > command-line flag? > > > This would avoid the nuisance of having two binaries -- narrow and > > > wide. > > > > Quite difficult. Even if we avoid having two or three separate > > binaries, we would still have separate binary representations of the > > string structs. It makes the maintainability of the software go down > > instead of up. > > > > > And it would give the python programmer a choice of efficiency > > > profiles. > > > > So instead of having just one test for my Unicode-handling code, I'll > > now have to run that same test *three times* -- once for each possible > > string engine option. Choice isn't always a good thing. > >
Forget Python and all these benchmarks. The problem is on an other level. Coding schemes, typography, usage of characters, ... For a given coding scheme, all code points/characters are equivalent. Expecting to handle a sub-range in a coding scheme without shaking that coding scheme is impossible. If a coding scheme does not give satisfaction, the only valid solution is to create a new coding scheme, cp1252, mac-roman, EBCDIC, ... or the interesting "TeX" case, where the "internal" coding depends on the fonts! Unicode (utf***), as just one another coding scheme, does not escape to this rule. This "Flexible String Representation" fails. Not only it is unable to stick with a coding scheme, it is a mixing of coding schemes, the worst of all possible implementations. jmf -- http://mail.python.org/mailman/listinfo/python-list