On Thu, Jul 25, 2013 at 7:27 PM, <wxjmfa...@gmail.com> wrote: > A coding scheme works with a unique set of characters (the repertoire), > and the implementation (the programming) works with a unique set > of encoded code points. The critical step is the path > {unique set of characters} <--> {unique set of encoded code points}
That's called Unicode. It maps the character 'A' to the code point U+0041 and so on. Code points are integers. In fact, they are very well represented in Python that way (also in Pike, fwiw): >>> ord('A') 65 >>> chr(65) 'A' >>> chr(123456) '\U0001e240' >>> ord(_) 123456 > In the byte string world, this step is a no-op. > > In Unicode, it is exactly the purpose of a "utf" to achieve this > step. "utf": a confusing name covering at the same time the > process and the result of the process. > A "utf chunk", a series of bits (not bytes), hold intrisically > the information about the character it is representing. No, now you're looking at another level: how to store codepoints in memory. That demands that they be stored as bits and bytes, because PC memory works that way. > utf32: as a pointed many times. You are already using it (maybe > without knowing it). Where? in fonts (OpenType technology), > rendering engines, pdf files. Why? Because there is not other > way to do it better. And UTF-32 is an excellent system... as long as you're okay with spending four bytes for every character. > See https://groups.google.com/forum/#!topic/comp.lang.python/XkTKE7U8CS0 I refuse to click this link. Give us a link to the python-list@python.org archive, or gmane, or something else more suited to the audience. I'm not going to Google Groups just to figure out what you're saying. > If you are not understanding my "editor" analogy. One other > proposed exercise. Build/create a flexible iso-8859-X coding > scheme. You will quickly understand where the bottleneck > is. > Two working ways: > - stupidly with an editor and your fingers. > - lazily with a sheet of paper and you head. What has this to do with the editor? > There is a clear difference between FSR and ucs-4/utf32. Yes. Memory usage. PEP 393 strings might take up half or even a quarter of what they'd take up in fixed UTF-32. Other than that, there's no difference. ChrisA -- http://mail.python.org/mailman/listinfo/python-list