Le mercredi 24 juillet 2013 16:47:36 UTC+2, Michael Torrie a écrit : > On 07/24/2013 07:40 AM, wxjmfa...@gmail.com wrote: > > > Sorry, you are not understanding Unicode. What is a Unicode > > > Transformation Format (UTF), what is the goal of a UTF and > > > why it is important for an implementation to work with a UTF. > > > > Really? Enlighten me. > > > > Personally, I would never use UTF as a representation *in memory* for a > > unicode string if it were up to me. Why? Because UTF characters are > > not uniform in byte width so accessing positions within the string is > > terribly slow and has to always be done by starting at the beginning of > > the string. That's at minimum O(n) compared to FSR's O(1). Surely you > > understand this. Do you dispute this fact? > > > > UTF is a great choice for interchange, though, and indeed that's what it > > was designed for. > > > > Are you calling for UTF to be adopted as the internal, in-memory > > representation of unicode? Or would you simply settle for UCS-4? > > Please be clear here. What are you saying? > > > > > Short example. Writing an editor with something like the > > > FSR is simply impossible (properly). > > > > How? FSR is just an implementation detail. It could be UCS-4 and it > > would also work.
--------- A coding scheme works with a unique set of characters (the repertoire), and the implementation (the programming) works with a unique set of encoded code points. The critical step is the path {unique set of characters} <--> {unique set of encoded code points} Fact: there is no other way to do it properly (This is explaining why we have to live today with all these coding schemes or also explaining why so many coding schemes hadto be created). How to understand it? With a sheet of paper and a pencil. In the byte string world, this step is a no-op. In Unicode, it is exactly the purpose of a "utf" to achieve this step. "utf": a confusing name covering at the same time the process and the result of the process. A "utf chunk", a series of bits (not bytes), hold intrisically the information about the character it is representing. Other "exotic" coding schemes like iso6937 of "CID-fonts" are woking in the same way. "Unicode" with the help of "utf(s)" does not differ from the basic rule. ----- ucs-2: ucs-2 is a perfecly and correctly working coding scheme. ucs-2 is not different from the other coding schemes and does not behave differently (cp... or iso-... or ...). It only covers a smaller repertoire. ----- utf32: as a pointed many times. You are already using it (maybe without knowing it). Where? in fonts (OpenType technology), rendering engines, pdf files. Why? Because there is not other way to do it better. ------ The Unicode table (its constuction) is a problem per se. It is not a technical problem, a very important "linguistic aspect" of Unicode. See https://groups.google.com/forum/#!topic/comp.lang.python/XkTKE7U8CS0 ------ If you are not understanding my "editor" analogy. One other proposed exercise. Build/create a flexible iso-8859-X coding scheme. You will quickly understand where the bottleneck is. Two working ways: - stupidly with an editor and your fingers. - lazily with a sheet of paper and you head. ---- About my benchmarks: No offense. You are not understanding them, because you do not understand what this FSR does and the coding of characters. It's a little bit a devil's circle. Conceptually, this FSR is spending its time in solving the problem it creates itsself, with plenty of side effects. ----- There is a clear difference between FSR and ucs-4/utf32. ----- See also: http://www.unicode.org/reports/tr17/ (In my mind, quite "dry" and not easy to understand at a first reading). jmf -- http://mail.python.org/mailman/listinfo/python-list