Le jeudi 25 juillet 2013 12:14:46 UTC+2, Chris Angelico a écrit : > On Thu, Jul 25, 2013 at 7:27 PM, <wxjmfa...@gmail.com> wrote: > > > A coding scheme works with a unique set of characters (the repertoire), > > > and the implementation (the programming) works with a unique set > > > of encoded code points. The critical step is the path > > > {unique set of characters} <--> {unique set of encoded code points} > > > > That's called Unicode. It maps the character 'A' to the code point > > U+0041 and so on. Code points are integers. In fact, they are very > > well represented in Python that way (also in Pike, fwiw): > > > > >>> ord('A') > > 65 > > >>> chr(65) > > 'A' > > >>> chr(123456) > > '\U0001e240' > > >>> ord(_) > > 123456 > > > > > In the byte string world, this step is a no-op. > > > > > > In Unicode, it is exactly the purpose of a "utf" to achieve this > > > step. "utf": a confusing name covering at the same time the > > > process and the result of the process. > > > A "utf chunk", a series of bits (not bytes), hold intrisically > > > the information about the character it is representing. > > > > No, now you're looking at another level: how to store codepoints in > > memory. That demands that they be stored as bits and bytes, because PC > > memory works that way. > > > > > utf32: as a pointed many times. You are already using it (maybe > > > without knowing it). Where? in fonts (OpenType technology), > > > rendering engines, pdf files. Why? Because there is not other > > > way to do it better. > > > > And UTF-32 is an excellent system... as long as you're okay with > > spending four bytes for every character. > > > > > See https://groups.google.com/forum/#!topic/comp.lang.python/XkTKE7U8CS0 > > > > I refuse to click this link. Give us a link to the > > python-list@python.org archive, or gmane, or something else more > > suited to the audience. I'm not going to Google Groups just to figure > > out what you're saying. > > > > > If you are not understanding my "editor" analogy. One other > > > proposed exercise. Build/create a flexible iso-8859-X coding > > > scheme. You will quickly understand where the bottleneck > > > is. > > > Two working ways: > > > - stupidly with an editor and your fingers. > > > - lazily with a sheet of paper and you head. > > > > What has this to do with the editor? > > > > > There is a clear difference between FSR and ucs-4/utf32. > > > > Yes. Memory usage. PEP 393 strings might take up half or even a > > quarter of what they'd take up in fixed UTF-32. Other than that, > > there's no difference. > > > > ChrisA
-------- Let start with a simple string \textemdash or \texttendash >>> sys.getsizeof('–') 40 >>> sys.getsizeof('a') 26 jmf jmf -- http://mail.python.org/mailman/listinfo/python-list