En Sun, 16 Sep 2007 21:58:09 -0300, [EMAIL PROTECTED] <[EMAIL PROTECTED]> escribi�:
>> I'm eagerly awaiting publication of your professional specification >> for correctly detecting the encoding of an arbitrary stream of >> bytes > > The very presence of an algorithm to detect encoding is a bug. > Files with they .txt extension should always be treated as ANSI > even if they contain binary data. Why ANSI? Because it's convenient to *you*? What about the rest of the world that don't speak English or even worse, don't use the Latin alpabet? What do you mean by "binary data"? Notepad is not interpreting the file as "binary", it's text, but interpreted using the wrong encoding. If you want to understand what happens here: The Unicode block for 'CJK Unified Han' goes from U+4E00 to U+9FFF and is the largest block in the basic plane, with more than 20000 code points. The previous block contains the famous 64 hexagrams, and the previous block is 'CJK Unified Han Extension A' ranging from U+3400 to U+4DBF. Note that ASCII letters go from 0x41-0x5A and 0x61-7A, and the range 0x4100-0x7AFF is totally contained inside the above Unicode blocks. Reading a small phrase containing only ASCII letters as it were in UTF16 would collapse each two letters into a single character, each character being part of 'CJK Unified Han'. (Space and punctuation are allowed in odd positions only, else the character would not belong to the Han blocks). As every character goes into the same code block the heuristics concludes that the text is some Estern language encoded in UTF16. This is the "Well you are speed" phrase interpreted as UTF16: u'\u6557\u6c6c\u7920\u756f\u6120\u6572\u7320\u6570\u6465' > Notepad should never be > allowed to try to decide what the encoding is if the the open > dialog has the encoding set to ANSI. I'm using notepad.exe version 5.1.2600.2180 (XP SP2 fully updated) and that's exactly what happens. I have to explicitely select Unicode in order to see those Han characters. -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list