On Thu, Dec 20, 2012 at 12:19 PM, <wxjmfa...@gmail.com> wrote: > The first (and it should be quite obvious) consequence is that > you create bloated, unnecessary and useless code. I simplify > the flexible string representation (FSR) and will use an "ascii" / > "non-ascii" model/terminology. > > If you are an "ascii" user, a FSR model has no sense. An > "ascii" user will use, per definition, only "ascii characters". > > If you are a "non-ascii" user, the FSR model is also a non > sense, because you are per definition a n"on-ascii" user of > "non-ascii" character. Any optimisation for "ascii" user just > become irrelevant. > > In one sense, to escape from this, you have to be at the same time > a non "ascii" user and a non "non-ascii" user. Impossible. > In both cases, a FSR model is useless and in both cases you are > forced to use bloated and unnecessary code.
As Terry and Steven have already pointed out, there is no such thing as a "non-ascii" user. Here I will take the complementary approach and point out that there is also no such thing as an "ascii" user. There are only users whose strings are 99.99% (or more) ASCII. A user may think that his program will never be given any non-ASCII input to deal with, but experience tells us that this thought is probably wrong. Suppose you were to split the Unicode representation into separate "ASCII-only" and "wide" data types. Then which data type is the correct one to choose for an "ascii" user? The correct answer is *always* the wide data type, for the reason stated above. If the user chooses the ASCII-only data type, then as soon his program encounters non-ASCII data, it breaks. The only users of the ASCII-only data type then would be the authors of buggy programs. The same issue applies to narrow (UTF-16) data types. So there really are only two viable, non-buggy options for Unicode representations: FSR, or always wide (UTF-32). The latter is wildly inefficient in many cases, so Python went with FSR. A third option might be proposed, which would be to have a build switch between FSR or always wide, with the promise that the two will be indistinguishable at the Python level (apart from the amount of memory used). This is probably not on the table, however, as it would have a non-negligible maintenance cost, and it's not clear that anybody other than you would actually want it. > A solution à la FSR can not work or not work in a optimized way. > It is not a coding scheme, it is a composite of coding schemes > handling several characters sets. Hard to imagine something worse. It is not a composite of coding schemes. The str type deals with exactly *one* character set -- the UCS. The different representations are not different coding schemes. They are *all* UTF-32. The only significant difference between the representations is that the leading zero bytes of each character are made implicit (i.e. truncated) if the nature of the string allows it. > Contrary to what has been said, the bad cases I presented here are > not corner cases. The only significantly regressive case that you've presented here has been str.replace on inputs engineered for bad performance. That's why people characterize them as corner cases -- because that's exactly what they are. > There is practically and systematically a regression > in Py33 compared to Py32. > That's very easy to test. I did all my tests at the light of what > I explained above. I was not a suprise for me to this expectidly > bad behaviour. Have you run stringbench.py yet? When I ran it on my system, the full set of Unicode benchmarks ran in 268.15 seconds for Python 3.2 versus 198.77 seconds for Python 3.3. That's a 26% overall speedup for the covered benchmarks, which seem reasonably thorough. That does not demonstrate a "systematic regression". If anything, that shows a systematic improvement. Your cherry-picking of benchmarks is like a driver who has two routes to their destination; one takes ten minutes on average but has one annoyingly long traffic light, while the second takes fifteen minutes on average but has no traffic lights (and a correspondingly higher accident rate). Yet for some reason you insist that the second route is better because the traffic light makes the first route "systematically" slower. -- http://mail.python.org/mailman/listinfo/python-list