Le dimanche 28 juillet 2013 05:53:22 UTC+2, Ian a écrit : > On Sat, Jul 27, 2013 at 12:21 PM, <wxjmfa...@gmail.com> wrote: > > > Back to utf. utfs are not only elements of a unique set of encoded > > > code points. They have an interesting feature. Each "utf chunk" > > > holds intrisically the character (in fact the code point) it is > > > supposed to represent. In utf-32, the obvious case, it is just > > > the code point. In utf-8, that's the first chunk which helps and > > > utf-16 is a mixed case (utf-8 / utf-32). In other words, in an > > > implementation using bytes, for any pointer position it is always > > > possible to find the corresponding encoded code point and from this > > > the corresponding character without any "programmed" information. See > > > my editor example, how to find the char under the caret? In fact, > > > a silly example, how can the caret can be positioned or moved, if > > > the underlying corresponding encoded code point can not be > > > dicerned! > > > > Yes, given a pointer location into a utf-8 or utf-16 string, it is > > easy to determine the identity of the code point at that location. > > But this is not often a useful operation, save for resynchronization > > in the case that the string data is corrupted. The caret of an editor > > does not conceptually correspond to a pointer location, but to a > > character index. Given a particular character index (e.g. 127504), an > > editor must be able to determine the identity and/or the memory > > location of the character at that index, and for UTF-8 and UTF-16 > > without an auxiliary data structure that is a O(n) operation. > > > > > 2) Take a look at this. Get rid of the overhead. > > > > > >>>> sys.getsizeof('b'*1000000 + 'c') > > > 1000026 > > >>>> sys.getsizeof('b'*1000000 + '€') > > > 2000040 > > > > > > What does it mean? It means that Python has to > > > reencode a str every time it is necessary because > > > it works with multiple codings. > > > > Large strings in practical usage do not need to be resized like this > > often. Python 3.3 has been in production use for months now, and you > > still have yet to produce any real-world application code that > > demonstrates a performance regression. If there is no real-world > > regression, then there is no problem. > > > > > 3) Unicode compliance. We know retrospectively, latin-1, > > > is was a bad choice. Unusable for 17 European languages. > > > Believe of not. 20 years of Unicode of incubation is not > > > long enough to learn it. When discussing once with a French > > > Python core dev, one with commit access, he did not know one > > > can not use latin-1 for the French language! > > > > Probably because for many French strings, one can. As far as I am > > aware, the only characters that are missing from Latin-1 are the Euro > > sign (an unfortunate victim of history), the ligature œ (I have no > > doubt that many users just type oe anyway), and the rare capital Ÿ > > (the miniscule version is present in Latin-1). All French strings > > that are fortunate enough to be absent these characters can be > > represented in Latin-1 and so will have a 1-byte width in the FSR.
------ latin-1? that's not even truth. >>> sys.getsizeof('a') 26 >>> sys.getsizeof('ü') 38 >>> sys.getsizeof('aa') 27 >>> sys.getsizeof('aü') 39 jmf -- http://mail.python.org/mailman/listinfo/python-list