On Sep 16, 9:27?pm, "Gabriel Genellina" <[EMAIL PROTECTED]> wrote: > En Sun, 16 Sep 2007 21:58:09 -0300, [EMAIL PROTECTED] > <[EMAIL PROTECTED]> escribi : > > >> I'm eagerly awaiting publication of your professional specification > >> for correctly detecting the encoding of an arbitrary stream of > >> bytes > > > The very presence of an algorithm to detect encoding is a bug. > > Files with they .txt extension should always be treated as ANSI > > even if they contain binary data. > > Why ANSI?
Because that's the absence of encoding? > Because it's convenient to *you*? No, it's ANSI unless told otherwise. > What about the rest of the world that don't speak > English or even worse, don't use the Latin alpabet? When the rest of the world creates the next generation of computers, THEY can chosse the defaults. > What do you mean by "binary data"? 8-bit, ASCII is only 7-bit. > Notepad is not interpreting the file as > "binary", it's text, And will treat non-ASCII data as if it were ASCII. > but interpreted using the wrong encoding. So that's not a serious bug? To decide that a file is Unicode despite the absence of the appropriate markers? > > If you want to understand what happens here: The Unicode block for 'CJK > Unified Han' goes from U+4E00 to U+9FFF and is the largest block in the > basic plane, with more than 20000 code points. The previous block contains > the famous 64 hexagrams, and the previous block is 'CJK Unified Han > Extension A' ranging from U+3400 to U+4DBF. > Note that ASCII letters go from 0x41-0x5A and 0x61-7A, and the range > 0x4100-0x7AFF is totally contained inside the above Unicode blocks. > Reading a small phrase containing only ASCII letters as it were in UTF16 > would collapse each two letters into a single character, each character > being part of 'CJK Unified Han'. (Space and punctuation are allowed in odd > positions only, else the character would not belong to the Han blocks). > As every character goes into the same code block the heuristics concludes > that the text is some Estern language encoded in UTF16. But...but...Notepad doesn't have a UTF16 option. > This is the "Well you are speed" phrase interpreted as UTF16: > u'\u6557\u6c6c\u7920\u756f\u6120\u6572\u7320\u6570\u6465' How can you tell from that that it's UTF16? If there's something stored in addition to those 18 bytes, you're being misleading. > > > Notepad should never be > > allowed to try to decide what the encoding is if the the open > > dialog has the encoding set to ANSI. > > I'm using notepad.exe version 5.1.2600.2180 (XP SP2 fully updated) and > that's exactly what happens. I have to explicitely select Unicode in order > to see those Han characters. So which is worse, you having to tell it that it's Unicode or Notepad deciding on its own that a file is Unicode when it isn't. > > -- > Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list