On Aug 19, 11:11 pm, wxjmfa...@gmail.com wrote: > Le dimanche 19 août 2012 19:48:06 UTC+2, Paul Rubin a écrit : > > > > > But they are not ascii pages, they are (as stated) MOSTLY ascii. > > > E.g. the characters are 99% ascii but 1% non-ascii, so 393 chooses > > > a much more memory-expensive encoding than UTF-8.
> > > Well, it seems some software producers know what they > are doing. > > >>> '€'.encode('cp1252') > b'\x80' > >>> '€'.encode('mac-roman') > b'\xdb' > >>> '€'.encode('iso-8859-1') > > Traceback (most recent call last): > File "<eta last command>", line 1, in <module> > UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' > in position 0: ordinal not in range(256) <facetious> You want the Euro-sign in iso-8859-1?? I object. I want the rupee sign ( ₹ ) http://en.wikipedia.org/wiki/Indian_rupee_sign And while we are at it, why not move it (both?) into ASCII? </facetious> The problem(s) are: 1. We dont really understand what you are objecting to. 2. Utf-8 like Huffman coding is a prefix code http://en.wikipedia.org/wiki/Prefix_code#Prefix_codes_in_use_today Like Huffman coding, it compresses based on a statistical argument. 3. Unlike Huffman coding the statistics is very political: "Is the Euro more important or Chinese ideograms?" depends on whom you ask -- http://mail.python.org/mailman/listinfo/python-list