Le dimanche 28 juillet 2013 17:52:47 UTC+2, Michael Torrie a écrit : > On 07/27/2013 12:21 PM, wxjmfa...@gmail.com wrote: > > > Good point. FSR, nice tool for those who wish to teach > > > Unicode. It is not every day, one has such an opportunity. > > > > I had a long e-mail composed, but decided to chop it down, but still too > > long. so I ditched a lot of the context, which jmf also seems to do. > > Apologies. > > > > 1. FSR *is* UTF-32 so it is as unicode compliant as UTF-32, since UTF-32 > > is an official encoding. FSR only differs from UTF-32 in that the > > padding zeros are stripped off such that it is stored in the most > > compact form that can handle all the characters in string, which is > > always known at string creation time. Now you can argue many things, > > but to say FSR is not unicode compliant is quite a stretch! What > > unicode entities or characters cannot be stored in strings using FSR? > > What sequences of bytes in FSR result in invalid Unicode entities? > > > > 2. strings in Python *never change*. They are immutable. The + > > operator always copies strings character by character into a new string > > object, even if Python had used UTF-8 internally. If you're doing a lot > > of string concatenations, perhaps you're using the wrong data type. A > > byte buffer might be better for you, where you can stuff utf-8 sequences > > into it to your heart's content. > > > > 3. UTF-8 and UTF-16 encodings, being variable width encodings, mean that > > slicing a string would be very very slow, and that's unacceptable for > > the use cases of python strings. I'm assuming you understand big O > > notation, as you talk of experience in many languages over the years. > > FSR and UTF-32 both are O(1) for slicing and lookups. UTF-8, 16 and any > > variable-width encoding are always O(n). A lot slower! > > > > 4. Unicode is, well, unicode. You seem to hop all over the place from > > talking about code points to bytes to bits, using them all > > interchangeably. And now you seem to be claiming that a particular byte > > encoding standard is by definition unicode (UTF-8). Or at least that's > > how it sounds. And also claim FSR is not compliant with unicode > > standards, which appears to me to be completely false. > > > > Is my understanding of these things wrong?
------ Compare these (a BDFL exemple, where I'using a non-ascii char) Py 3.2 (narrow build) >>> timeit.timeit("a = 'hundred'; 'x' in a") 0.09897159682121348 >>> timeit.timeit("a = 'hundre€'; 'x' in a") 0.09079501961732461 >>> sys.getsizeof('d') 32 >>> sys.getsizeof('€') 32 >>> sys.getsizeof('dd') 34 >>> sys.getsizeof('d€') 34 Py3.3 >>> timeit.timeit("a = 'hundred'; 'x' in a") 0.12183182740848858 >>> timeit.timeit("a = 'hundre€'; 'x' in a") 0.2365732969632326 >>> sys.getsizeof('d') 26 >>> sys.getsizeof('€') 40 >>> sys.getsizeof('dd') 27 >>> sys.getsizeof('d€') 42 Tell me which one seems to be more "unicode compliant"? The goal of Unicode is to handle every char "equaly". Now, the problem: memory. Do not forget that à la "FSR" mechanism for a non-ascii user is *irrelevant*. As soon as one uses one single non-ascii, your ascii feature is lost. (That why we have all these dedicated coding schemes, utfs included). >>> sys.getsizeof('abc' * 1000 + 'z') 3026 >>> sys.getsizeof('abc' * 1000 + '\U00010010') 12044 A bit secret. The larger a repertoire of characters is, the more bits you needs. Secret #2. You can not escape from this. jmf -- http://mail.python.org/mailman/listinfo/python-list